| |
| |
Foreword | |
| |
| |
| |
Preface | |
| |
| |
Acknowledgments | |
| |
| |
| |
Preliminaries | |
| |
| |
| |
Introduction | |
| |
| |
| |
What Is Data Mining? | |
| |
| |
| |
Where Is Data Mining Used? | |
| |
| |
| |
The Origins of Data Mining | |
| |
| |
| |
The Rapid Growth of Data Mining | |
| |
| |
| |
Why Are There So Many Different Methods? | |
| |
| |
| |
Terminology and Notation | |
| |
| |
| |
Road Maps to This Book | |
| |
| |
| |
Overview of the Data Mining Process | |
| |
| |
| |
Introduction | |
| |
| |
| |
Core Ideas in Data Mining | |
| |
| |
| |
Supervised and Unsupervised Learning | |
| |
| |
| |
The Steps in Data Mining | |
| |
| |
| |
Preliminary Steps | |
| |
| |
| |
Building a Model: Example with Linear Regression | |
| |
| |
| |
Using Excel for Data Mining | |
| |
| |
Problems | |
| |
| |
| |
Data Exploration and Dimension Reduction | |
| |
| |
| |
Data Visualization | |
| |
| |
| |
Uses of Data Visualization | |
| |
| |
| |
Data Examples | |
| |
| |
| |
Boston Housing Data | |
| |
| |
| |
Ridership on Amtrak Trains | |
| |
| |
| |
Basic Charts: bar charts, line graphs, and scatterplots | |
| |
| |
Distribution Plots | |
| |
| |
Heatmaps: visualizing correlations and missing values | |
| |
| |
| |
MultiDimensional Visualization | |
| |
| |
Adding Variables: color, hue, size, shape, multiple panels, animation | |
| |
| |
Manipulations: rescaling,aggregation and hierarchies, zooming and panning, filtering | |
| |
| |
Reference: trend line and labels | |
| |
| |
Scaling up: large datasets | |
| |
| |
Multivariate plot: parallel coordinates plot | |
| |
| |
Interactive visualization | |
| |
| |
| |
Specialized Visualizations | |
| |
| |
Visualizing networked data | |
| |
| |
Visualizing hierarchical data: treemaps | |
| |
| |
Visualizing geographical data: maps | |
| |
| |
| |
Summary of major visualizations and operations, according to data mining goal | |
| |
| |
Prediction | |
| |
| |
Classification | |
| |
| |
Time series forecasting | |
| |
| |
Unsupervised learning | |
| |
| |
Problems | |
| |
| |
| |
Dimension Reduction | |
| |
| |
| |
Introduction | |
| |
| |
| |
Practical Considerations | |
| |
| |
| |
House Prices in Boston | |
| |
| |
| |
Data Summaries | |
| |
| |
| |
Correlation Analysis | |
| |
| |
| |
Reducing the Number of Categories in Categorical Variables | |
| |
| |
| |
Converting A Categorical Variable to A Numerical Variable | |
| |
| |
| |
Principal Components Analysis | |
| |
| |
| |
Breakfast Cereals | |
| |
| |
Principal Components | |
| |
| |
Normalizing the Data | |
| |
| |
Using Principal Components for Classification and Prediction | |
| |
| |
| |
Dimension Reduction Using Regression Models | |
| |
| |
| |
Dimension Reduction Using Classification and Regression Trees | |
| |
| |
Problems | |
| |
| |
| |
Performance Evaluation | |
| |
| |
| |
Evaluating Classification and Predictive Performance | |
| |
| |
| |
Introduction | |
| |
| |
| |
Judging Classification Performance | |
| |
| |
Benchmark: The Naive Rule | |
| |
| |
Class Separation | |
| |
| |
The Classification Matrix | |
| |
| |
Using the Validation Data | |
| |
| |
Accuracy Measures | |
| |
| |
Cutoff for Classification | |
| |
| |
Performance in Unequal Importance of Classes | |
| |
| |
Asymmetric Misclassification Costs | |
| |
| |
Oversampling and Asymmetric Costs | |
| |
| |
Classification Using a Triage Strategy | |
| |
| |
| |
Evaluating Predictive Performance | |
| |
| |
Benchmark: The Average | |
| |
| |
Prediction Accuracy Measures | |
| |
| |
Problems | |
| |
| |
| |
Prediction and Classification Methods | |
| |
| |
| |
Multiple Linear Regression | |
| |
| |
| |
Introduction | |
| |
| |
| |
Explanatory vs. Predictive Modeling | |
| |
| |
| |
Estimating the Regression Equation and Prediction | |
| |
| |
Example: Predicting the Price of Used Toyota Corolla Automobiles | |
| |
| |
| |
Variable Selection in Linear Regression | |
| |
| |
Reducing the Number of Predictors | |
| |
| |
How to Reduce the Number of Predictors | |
| |
| |
Problems | |
| |
| |
| |
kNearest | |
| |
| |
Neighbors (kNN) | |
| |
| |
| |
The kNN | |
| |
| |
Classifier | |
| |
| |
| |
Determining Neighbors | |
| |
| |
Classification Rule | |
| |
| |
Example: Riding Mowers | |
| |
| |
Choosing k | |
| |
| |
Setting the Cutoff Value | |
| |
| |
kNN | |
| |
| |
With More Than 2 Classes | |
| |
| |
| |
kNN | |
| |
| |
for a Numerical Response | |
| |
| |
| |
Advantages and Shortcomings of kNN | |
| |
| |
Algorithms | |
| |
| |
Problems | |
| |
| |
| |
Naive Bayes | |
| |
| |
| |
Introduction | |
| |
| |
| |
Predicting Fraudulent Financial Reporting | |
| |
| |
The Practical Difficulty with the Complete (Exact) Bayes Procedure | |
| |
| |
The Solution: Na‹ve Bayes | |
| |
| |
| |
Predicting Fraudulent Financial Reports, 2 Predictors | |
| |
| |
| |
Predicting Delayed Flights | |
| |
| |
Advantages and Shortcomings of the naive Bayes Classifier | |
| |
| |
Problems | |
| |
| |
| |
Classification and Regression Trees | |
| |
| |
| |
Introduction | |
| |
| |
| |
Classification Trees | |
| |
| |
Recursive Partitioning | |
| |
| |
| |
Riding Mowers | |
| |
| |
| |
Measures of Impurity | |
| |
| |
| |
Evaluating the Performance of a Classification Tree | |
| |
| |
| |
Acceptance of Personal Loan | |
| |
| |
| |
Avoiding Overfitting | |
| |
| |
Stopping Tree Growth: CHAID | |
| |
| |
Pruning the Tree | |
| |
| |
| |
Classification Rules from Trees | |
| |
| |
| |
Classification Trees for More Than 2 Classes | |
| |
| |
| |
Regression Trees | |
| |
| |
Prediction | |
| |
| |
Measuring Impurity | |
| |
| |
Evaluating Performance | |
| |
| |
| |
Advantages, Weaknesses, and Extensions | |
| |
| |
Problems | |
| |
| |
| |
Logistic Regression | |
| |
| |
| |
Introduction | |
| |
| |
| |
The Logistic Regression Model | |
| |
| |
Example: Acceptance of Personal Loan | |
| |
| |
Model with a Single Predictor | |
| |
| |
Estimating the Logistic Model from Data: Computing Parameter | |
| |
| |
Estimates | |
| |
| |
Interpreting Results in Terms of Odds | |
| |
| |
| |
Evaluating Classification Performance | |
| |
| |
Variable Selection | |
| |
| |
| |
Example of Complete Analysis: Predicting Delayed Flights | |
| |
| |
Data Preprocessing | |
| |
| |
Model Fitting and Estimation | |
| |
| |
Model Interpretation | |
| |
| |
Model Performance | |
| |
| |
Variable Selection | |
| |
| |
Appendix: Logistic Regression for Profiling | |
| |
| |
Appendix: Logistic regression for profiling | |
| |
| |
Appendix: B: Evaluating Goodness of Fit | |
| |
| |
Appendix B Evaluating Goodness of Fit | |
| |
| |
Appendix: C: Logistic Regression for More Than Two Classes | |
| |
| |
Appendix C Logistic Regression for More Than Two Classes | |
| |
| |
Problems | |
| |
| |
| |
Neural Nets | |
| |
| |
| |
Introduction | |
| |
| |
| |
Concept and Structure of a Neural Network | |
| |
| |
| |
Fitting a Network to Data | |
| |
| |
| |
Tiny Dataset | |
| |
| |
Computing Output of Nodes | |
| |
| |
Preprocessing the Data | |
| |
| |
Training the Model | |
| |
| |
| |
Classifying Accident Severity | |
| |
| |
Avoiding overfitting | |
| |
| |
Using the Output for Prediction and Classification | |
| |
| |
| |
Required User Input | |
| |
| |
| |
Exploring the Relationship Between Predictors and Response | |
| |
| |
| |
Advantages and Weaknesses of Neural Networks | |
| |
| |
Problems | |
| |
| |
| |
Discriminant Analysis | |
| |
| |
| |
Introduction | |
| |
| |
| |
Riding Mowers | |
| |
| |
| |
Personal Loan Acceptance | |
| |
| |
| |
Distance of an Observation from a Class | |
| |
| |
| |
Fisher's Linear Classification Functions | |
| |
| |
| |
Classification Performance of Discriminant Analysis | |
| |
| |
| |
Prior Probabilities | |
| |
| |
| |
Unequal Misclassification Costs | |
| |
| |
| |
Classifying More Than Two Classes | |
| |
| |
| |
Medical Dispatch to Accident Scenes | |
| |
| |
| |
Advantages and Weaknesses | |
| |
| |
Problems | |
| |
| |
| |
Mining Relationships Among Records | |
| |
| |
| |
Association Rules | |
| |
| |
| |
Introduction | |
| |
| |
| |
Discovering Association Rules in Transaction Databases | |
| |
| |
| |
Synthetic Data on Purchases of Phone Faceplates | |
| |
| |
| |
Generating Candidate Rules | |
| |
| |
The Apriori Algorithm | |
| |
| |
| |
Selecting Strong Rules | |
| |
| |
Support and Confidence | |
| |
| |
Lift Ratio | |
| |
| |
Data Format | |
| |
| |
The Process of Rule Selection | |
| |
| |
Interpreting the Results | |
| |
| |
Statistical Significance of Rules | |
| |
| |
| |
Rules for Similar Book Purchases | |
| |
| |
| |
Summary | |
| |
| |
Problems | |
| |
| |
| |
Cluster Analysis | |
| |
| |
| |
Introduction | |
| |
| |
Example: Public Utilities | |
| |
| |
| |
Measuring Distance Between Two Records | |
| |
| |
Euclidean Distance | |
| |
| |
Normalizing Numerical Measurements | |
| |
| |
Other Distance Measures for Numerical Data | |
| |
| |
Distance Measures for Categorical Data | |
| |
| |
Distance Measures for Mixed Data | |
| |
| |
| |
Measuring Distance Between Two Clusters | |
| |
| |
| |
Hierarchical (Agglomerative) Clustering | |
| |
| |
Contents | |
| |
| |
Minimum Distance (Single Linkage) | |
| |
| |
Maximum Distance (Complete Linkage) | |
| |
| |
Average Distance (Average Linkage) | |
| |
| |
Dendrograms: Displaying Clustering Process and Results | |
| |
| |
Validating Clusters | |
| |
| |
Limitations of Hierarchical Clustering | |
| |
| |
| |
Nonhierarchical Clustering: The kMeans Algorithm | |
| |
| |
Initial Partition into k Clusters | |
| |
| |
Problems | |
| |
| |
| |
Forecasting Time Series | |
| |
| |
| |
Handling Time Series | |
| |
| |
| |
Introduction | |
| |
| |
| |
Explanatory vs. Predictive Modeling | |
| |
| |
| |
Popular Forecasting Methods in Business | |
| |
| |
Combining Methods | |
| |
| |
| |
Time Series Components | |
| |
| |
Example: Ridership on Amtrak Trains | |
| |
| |
| |
Data Partitioning | |
| |
| |
Problems | |
| |
| |
| |
Regression Based Forecasting | |
| |
| |
| |
A Model with Trend | |
| |
| |
Linear Trend | |
| |
| |
Exponential Trend | |
| |
| |
Polynomial Trend | |
| |
| |
| |
A Model with Seasonality | |
| |
| |
| |
A model with trend and seasonality | |
| |
| |
| |
Autocorrelation and ARIMA Models | |
| |
| |
| |
Computing Autocorrelation | |
| |
| |
Computing Autocorrelation | |
| |
| |
| |
Improving Forecasts by Integrating Autocorrelation Information | |
| |
| |
Improving Forecasts by Integrating Autocorrelation Information | |
| |
| |
| |
Evaluating Predictability | |
| |
| |
Evaluating Predictability | |
| |
| |
Problems | |
| |
| |
| |
Smoothing Methods | |
| |
| |
| |
Introduction | |
| |
| |
| |
Moving Average | |
| |
| |
Centered Moving Average for Visualization | |
| |
| |
Trailing Moving Average for Forecasting | |
| |
| |
Choosing Window Width | |
| |
| |
| |
Simple Exponential Smoothing | |
| |
| |
Choosing Smoothing Parameter | |
| |
| |
Relation Between Moving Average and Simple Exponential | |
| |
| |
Smoothing | |
| |
| |
| |
Advanced Exponential Smoothing | |
| |
| |
Series with a trend | |
| |
| |
Series with a trend and seasonality | |
| |
| |
Series with seasonality | |
| |
| |
| |
Problems | |
| |
| |
| |
Cases | |
| |
| |
| |
Cases | |
| |
| |
| |
Charles Book Club | |
| |
| |
| |
German Credit | |
| |
| |
| |
Tayko Software Cataloger | |
| |
| |
| |
Segmenting Consumers of Bath Soap | |
| |
| |
| |
DirectMail Fundraising | |
| |
| |
| |
Catalog CrossSelling | |
| |
| |
| |
Predicting Bankruptcy | |
| |
| |
| |
Time Series Case: Forecasting Public Transportation Demand | |
| |
| |
References | |
| |
| |
Index | |