| |
| |
Preface | |
| |
| |
| |
Introduction to Data Mining | |
| |
| |
What Is Data Mining? | |
| |
| |
Why Data Mining? | |
| |
| |
Need for Human Direction of Data Mining | |
| |
| |
Cross-Industry Standard Process: CRISP-DM | |
| |
| |
| |
Analyzing Automobile Warranty Claims: Example of the CRISP-DM Industry Standard Process in Action | |
| |
| |
Fallacies of Data Mining | |
| |
| |
What Tasks Can Data Mining Accomplish? | |
| |
| |
Description | |
| |
| |
Estimation | |
| |
| |
Prediction | |
| |
| |
Classification | |
| |
| |
Clustering | |
| |
| |
Association | |
| |
| |
| |
Predicting Abnormal Stock Market Returns Using Neural Networks | |
| |
| |
| |
Mining Association Rules from Legal Databases | |
| |
| |
| |
Predicting Corporate Bankruptcies Using Decision Trees | |
| |
| |
| |
Profiling the Tourism Market Using k-Means Clustering Analysis | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Data Preprocessing | |
| |
| |
Why Do We Need to Preprocess the Data? | |
| |
| |
Data Cleaning | |
| |
| |
Handling Missing Data | |
| |
| |
Identifying Misclassifications | |
| |
| |
Graphical Methods for Identifying Outliers | |
| |
| |
Data Transformation | |
| |
| |
Min-Max Normalization | |
| |
| |
Z-Score Standardization | |
| |
| |
Numerical Methods for Identifying Outliers | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Exploratory Data Analysis | |
| |
| |
Hypothesis Testing versus Exploratory Data Analysis | |
| |
| |
Getting to Know the Data Set | |
| |
| |
Dealing with Correlated Variables | |
| |
| |
Exploring Categorical Variables | |
| |
| |
Using EDA to Uncover Anomalous Fields | |
| |
| |
Exploring Numerical Variables | |
| |
| |
Exploring Multivariate Relationships | |
| |
| |
Selecting Interesting Subsets of the Data for Further Investigation | |
| |
| |
Binning | |
| |
| |
Summary | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Statistical Approaches to Estimation and Prediction | |
| |
| |
Data Mining Tasks in Discovering Knowledge in Data | |
| |
| |
Statistical Approaches to Estimation and Prediction | |
| |
| |
Univariate Methods: Measures of Center and Spread | |
| |
| |
Statistical Inference | |
| |
| |
How Confident Are We in Our Estimates? | |
| |
| |
Confidence Interval Estimation | |
| |
| |
Bivariate Methods: Simple Linear Regression | |
| |
| |
Dangers of Extrapolation | |
| |
| |
Confidence Intervals for the Mean Value of y Given x | |
| |
| |
Prediction Intervals for a Randomly Chosen Value of y Given x | |
| |
| |
Multiple Regression | |
| |
| |
Verifying Model Assumptions | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
k-Nearest Neighbor Algorithm | |
| |
| |
Supervised versus Unsupervised Methods | |
| |
| |
Methodology for Supervised Modeling | |
| |
| |
Bias-Variance Trade-Off | |
| |
| |
Classification Task | |
| |
| |
k-Nearest Neighbor Algorithm | |
| |
| |
Distance Function | |
| |
| |
Combination Function | |
| |
| |
Simple Unweighted Voting | |
| |
| |
Weighted Voting | |
| |
| |
Quantifying Attribute Relevance: Stretching the Axes | |
| |
| |
Database Considerations | |
| |
| |
k-Nearest Neighbor Algorithm for Estimation and Prediction | |
| |
| |
Choosing k | |
| |
| |
Reference | |
| |
| |
Exercises | |
| |
| |
| |
Decision Trees | |
| |
| |
Classification and Regression Trees | |
| |
| |
C4.5 Algorithm | |
| |
| |
Decision Rules | |
| |
| |
Comparison of the C5.0 and CART Algorithms Applied to Real Data | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Neural Networks | |
| |
| |
Input and Output Encoding | |
| |
| |
Neural Networks for Estimation and Prediction | |
| |
| |
Simple Example of a Neural Network | |
| |
| |
Sigmoid Activation Function | |
| |
| |
Back-Propagation | |
| |
| |
Gradient Descent Method | |
| |
| |
Back-Propagation Rules | |
| |
| |
Example of Back-Propagation | |
| |
| |
Termination Criteria | |
| |
| |
Learning Rate | |
| |
| |
Momentum Term | |
| |
| |
Sensitivity Analysis | |
| |
| |
Application of Neural Network Modeling | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Hierarchical and k-Means Clustering | |
| |
| |
Clustering Task | |
| |
| |
Hierarchical Clustering Methods | |
| |
| |
Single-Linkage Clustering | |
| |
| |
Complete-Linkage Clustering | |
| |
| |
k-Means Clustering | |
| |
| |
Example of k-Means Clustering at Work | |
| |
| |
Application of k-Means Clustering Using SAS Enterprise Miner | |
| |
| |
Using Cluster Membership to Predict Churn | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Kohonen Networks | |
| |
| |
Self-Organizing Maps | |
| |
| |
Kohonen Networks | |
| |
| |
Example of a Kohonen Network Study | |
| |
| |
Cluster Validity | |
| |
| |
Application of Clustering Using Kohonen Networks | |
| |
| |
Interpreting the Clusters | |
| |
| |
Cluster Profiles | |
| |
| |
Using Cluster Membership as Input to Downstream Data Mining Models | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Association Rules | |
| |
| |
Affinity Analysis and Market Basket Analysis | |
| |
| |
Data Representation for Market Basket Analysis | |
| |
| |
Support, Confidence, Frequent Itemsets, and the A Priori Property | |
| |
| |
How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets | |
| |
| |
How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules | |
| |
| |
Extension from Flag Data to General Categorical Data | |
| |
| |
Information-Theoretic Approach: Generalized Rule Induction Method | |
| |
| |
J-Measure | |
| |
| |
Application of Generalized Rule Induction | |
| |
| |
When Not to Use Association Rules | |
| |
| |
Do Association Rules Represent Supervised or Unsupervised Learning? | |
| |
| |
Local Patterns versus Global Models | |
| |
| |
References | |
| |
| |
Exercises | |
| |
| |
| |
Model Evaluation Techniques | |
| |
| |
Model Evaluation Techniques for the Description Task | |
| |
| |
Model Evaluation Techniques for the Estimation and Prediction Tasks | |
| |
| |
Model Evaluation Techniques for the Classification Task | |
| |
| |
Error Rate, False Positives, and False Negatives | |
| |
| |
Misclassification Cost Adjustment to Reflect Real-World Concerns | |
| |
| |
Decision Cost/Benefit Analysis | |
| |
| |
Lift Charts and Gains Charts | |
| |
| |
Interweaving Model Evaluation with Model Building | |
| |
| |
Confluence of Results: Applying a Suite of Models | |
| |
| |
Reference | |
| |
| |
Exercises | |
| |
| |
Epilogue: "We've Only Just Begun" | |
| |
| |
Index | |