| |
| |
Preface | |
| |
| |
Introduction | |
| |
| |
| |
Data Exploration as a Process | |
| |
| |
| |
The Data Exploration Process | |
| |
| |
| |
Stage 1: Exploring the Problem Space | |
| |
| |
| |
Stage 2: Exploring the Solution Space | |
| |
| |
| |
Stage 3: Specifying the Implementation Method | |
| |
| |
| |
Stage 4: Mining the Data | |
| |
| |
| |
Exploration: Mining and Modeling | |
| |
| |
| |
Data Mining, Modeling, and Modeling Tools | |
| |
| |
| |
Ten Golden Rules | |
| |
| |
| |
Introducing Modeling Tools | |
| |
| |
| |
Types of Models | |
| |
| |
| |
Active and Passive Models | |
| |
| |
| |
Explanatory and Predictive Models | |
| |
| |
| |
Static and Continuously Learning Models | |
| |
| |
| |
Summary | |
| |
| |
Supplemental Material | |
| |
| |
A Continuously Learning Model Application | |
| |
| |
How the Continuously Learning Model Worked | |
| |
| |
| |
The Nature of the World and Its Impact on Data Preparation | |
| |
| |
| |
Measuring the World | |
| |
| |
| |
Objects | |
| |
| |
| |
Capturing Measurements | |
| |
| |
| |
Errors of Measurement | |
| |
| |
| |
Tying Measurements to the Real World | |
| |
| |
| |
Types of Measurements | |
| |
| |
| |
Scalar Measurements | |
| |
| |
| |
Nonscalar Measurements | |
| |
| |
| |
Continua of Attributes of Variables | |
| |
| |
| |
The Qualitative-Quantitative Continuum | |
| |
| |
| |
The Discrete-Continuous Continuum | |
| |
| |
| |
Scale Measurement Example | |
| |
| |
| |
Transformations and Difficulties-- Variables, Data, and Information | |
| |
| |
| |
Building Mineable Data Representations | |
| |
| |
| |
Data Representation | |
| |
| |
| |
Building Data--Dealing with Variables | |
| |
| |
| |
Building Mineable Data Sets | |
| |
| |
| |
Summary | |
| |
| |
Supplemental Material | |
| |
| |
Combinations | |
| |
| |
| |
Data Preparation as a Process | |
| |
| |
| |
Data Preparation: Inputs, Outputs, Models, and Decisions | |
| |
| |
| |
Step 1: Prepare the Data | |
| |
| |
| |
Step 2: Survey the Data | |
| |
| |
| |
Step 3: Model the Data | |
| |
| |
| |
Use the Model | |
| |
| |
| |
Modeling Tools and Data Preparation | |
| |
| |
| |
How Modeling Tools Drive Data Preparation | |
| |
| |
| |
Decision Trees | |
| |
| |
| |
Decision Lists | |
| |
| |
| |
Neural Networks | |
| |
| |
| |
Evolution Programs | |
| |
| |
| |
Modeling Data with the Tools | |
| |
| |
| |
Predictions and Rules | |
| |
| |
| |
Choosing Techniques | |
| |
| |
| |
Missing Data and Modeling Tools | |
| |
| |
| |
Stages of Data Preparation | |
| |
| |
| |
Stage 1: Accessing the Data | |
| |
| |
| |
Stage 2: Auditing the Data | |
| |
| |
| |
Stage 3: Enhancing and Enriching the Data | |
| |
| |
| |
Stage 4: Looking for Sampling Bias | |
| |
| |
| |
Stage 5: Determining Data Structure (Super-, Macro-, and Micro-) | |
| |
| |
| |
Stage 6: Building the PIE | |
| |
| |
| |
Stage 7: Surveying the Data | |
| |
| |
| |
Stage 8: Modeling the Data | |
| |
| |
| |
And the Result Is ...? | |
| |
| |
| |
Getting the Data: Basic Preparation | |
| |
| |
| |
Data Discovery | |
| |
| |
| |
Data Access Issues | |
| |
| |
| |
Data Characterization | |
| |
| |
| |
Detail/Aggregation Level (Granularity) | |
| |
| |
| |
Consistency | |
| |
| |
| |
Pollution | |
| |
| |
| |
Objects | |
| |
| |
| |
Relationship | |
| |
| |
| |
Domain | |
| |
| |
| |
Defaults | |
| |
| |
| |
Integrity | |
| |
| |
| |
Concurrency | |
| |
| |
| |
Duplicate or Redundant Variables | |
| |
| |
| |
Data Set Assembly | |
| |
| |
| |
Reverse Pivoting | |
| |
| |
| |
Feature Extraction | |
| |
| |
| |
Physical or Behavioral Data Sets | |
| |
| |
| |
Explanatory Structure | |
| |
| |
| |
Data Enhancement or Enrichment | |
| |
| |
| |
Sampling Bias | |
| |
| |
| |
Example 1: Credit | |
| |
| |
| |
Looking at the Variables | |
| |
| |
| |
Relationships between Variables | |
| |
| |
| |
Example 2: Shoe | |
| |
| |
| |
Looking at the Variables | |
| |
| |
| |
Relationships between Variables | |
| |
| |
| |
The Data Assay | |
| |
| |
| |
Sampling, Variability, and Confidence | |
| |
| |
| |
Sampling, or First Catch Your Hare! | |
| |
| |
| |
How Much Data? | |
| |
| |
| |
Variability | |
| |
| |
| |
Converging on a Representative Sample | |
| |
| |
| |
Measuring Variability | |
| |
| |
| |
Variability and Deviation | |
| |
| |
| |
Confidence | |
| |
| |
| |
Variability of Numeric Variables | |
| |
| |
| |
Variability and Sampling | |
| |
| |
| |
Variability and Convergence | |
| |
| |
| |
Variability and Confidence in Alpha Variables | |
| |
| |
| |
Ordering and Rate of Discovery | |
| |
| |
| |
Measuring Confidence | |
| |
| |
| |
Modeling and Confidence with the Whole Population | |
| |
| |
| |
Testing for Confidence | |
| |
| |
| |
Confidence Tests and Variability | |
| |
| |
| |
Confidence in Capturing Variability | |
| |
| |
| |
A Brief Introduction to the Normal Distribution | |
| |
| |
| |
Normally Distributed Probabilities | |
| |
| |
| |
Capturing Normally Distributed Probabilities: An Example | |
| |
| |
| |
Capturing Confidence, Capturing Variance | |
| |
| |
| |
Problems and Shortcomings of Taking Samples Using Variability | |
| |
| |
| |
Missing Values | |
| |
| |
| |
Constants (Variables with Only One Value) | |
| |
| |
| |
Problems with Sampling | |
| |
| |
| |
Monotonic Variable Detection | |
| |
| |
| |
Interstitial Linearity | |
| |
| |
| |
Rate of Discovery | |
| |
| |
| |
Confidence and Instance Count | |
| |
| |
| |
Summary | |
| |
| |
Supplemental Material | |
| |
| |
Confidence Samples | |
| |
| |
| |
Handling Nonnumerical Variables | |
| |
| |
| |
Representing Alphas and Remapping | |
| |
| |
| |
One-of-n Remapping | |
| |
| |
| |
m-of-n Remapping | |
| |
| |
| |
Remapping to Eliminate Ordering | |
| |
| |
| |
Remapping One-to-Many Patterns, or Ill-Formed Problems | |
| |
| |
| |
Remapping Circular Discontinuity | |
| |
| |
| |
State Space | |
| |
| |
| |
Unit State Space | |
| |
| |
| |
Pythagoras in State Space | |
| |
| |
| |
Position in State Space | |
| |
| |
| |
Neighbors and Associates | |
| |
| |
| |
Density and Sparsity | |
| |
| |
| |
Nearby and Distant Nearest Neighbors | |
| |
| |
| |
Normalizing Measured Point Separation | |
| |
| |
| |
Contours, Peaks, and Valleys | |
| |
| |
| |
Mapping State Space | |
| |
| |
| |
Objects in State Space | |
| |
| |
| |
Phase Space | |
| |
| |
| |
Mapping Alpha Values | |
| |
| |
| |
Location, Location, Location! | |
| |
| |
| |
Numerics, Alphas, and the Montreal Canadiens | |
| |
| |
| |
Joint Distribution Tables | |
| |
| |
| |
Two-Way Tables | |
| |
| |
| |
More Values, More Variables, and Meaning of the Numeration | |
| |
| |
| |
Dealing with Low-Frequency Alpha Labels and Other Problems | |
| |
| |
| |
Dimensionality | |
| |
| |
| |
Multidimensional Scaling | |
| |
| |
| |
Squashing a Triangle | |
| |
| |
| |
Projecting Alpha Values | |
| |
| |
| |
Scree Plots | |
| |
| |
| |
Practical Consideration--Implementing Alpha Numeration in the Demonstration Code | |
| |
| |
| |
Implementing Neighborhoods | |
| |
| |
| |
Implementing Numeration in All Alpha Data Sets | |
| |
| |
| |
Implementing Dimensionality Reduction for Variables | |
| |
| |
| |
Summary | |
| |
| |
| |
Normalizing and Redistributing Variables | |
| |
| |
| |
Normalizing a Variable's Range | |
| |
| |
| |
Review of Data Preparation and Modeling (Training, Testing, and Execution) | |
| |
| |
| |
The Nature and Scope of the Out-of-Range Values Problem | |
| |
| |
| |
Discovering the Range of Values When Building the PIE | |
| |
| |
| |
Out-of-Range Values When Training | |
| |
| |
| |
Out-of-Range Values When Testing | |
| |
| |
| |
Out-of-Range Values When Executing | |
| |
| |
| |
Scaling Transformations | |
| |
| |
| |
Softmax Scaling | |
| |
| |
| |
Normalizing Ranges | |
| |
| |
| |
Redistributing Variable Values | |
| |
| |
| |
The Nature of Distributions | |
| |
| |
| |
Distributive Difficulties | |
| |
| |
| |
Adjusting Distributions | |
| |
| |
| |
Modified Distributions | |
| |
| |
| |
Summary | |
| |
| |
Supplemental Material | |
| |
| |
The Logistic Function | |
| |
| |
Modifying the Linear Part of the Logistic Function Range | |
| |
| |
| |
Replacing Missing and Empty Values | |
| |
| |
| |
Retaining Information about Missing Values | |
| |
| |
| |
Missing-Value Patterns | |
| |
| |
| |
Capturing Patterns | |
| |
| |
| |
Replacing Missing Values | |
| |
| |
| |
Unbiased Estimators | |
| |
| |
| |
Variability Relationships | |
| |
| |
| |
Relationships between Variables | |
| |
| |
| |
Preserving Between-Variable Relationships | |
| |
| |
| |
Summary | |
| |
| |
Supplemental Material | |
| |
| |
Using Regression to Find Least Information-Damaging Missing Values | |
| |
| |
Alternative Methods of Missing-Value Replacement | |
| |
| |
| |
Series Variables | |
| |
| |
| |
Here There Be Dragons! | |
| |
| |
| |
Types of Series | |
| |
| |
| |
Describing Series Data | |
| |
| |
| |
Constructing a Series | |
| |
| |
| |
Features of a Series | |
| |
| |
| |
Describing a Series--Fourier | |
| |
| |
| |
Describing a Series--Spectrum | |
| |
| |
| |
Describing a Series--Trend, Seasonality, Cycles, Noise | |
| |
| |
| |
Describing a Series--Autocorrelation | |
| |
| |
| |
Modeling Series Data | |
| |
| |
| |
Repairing Series Data Problems | |
| |
| |
| |
Missing Values | |
| |
| |
| |
Outliers | |
| |
| |
| |
Nonuniform Displacement | |
| |
| |
| |
Trend | |
| |
| |
| |
Tools | |
| |
| |
| |
Filtering | |
| |
| |
| |
Moving Averages | |
| |
| |
| |
Smoothing 1--PVM Smoothing | |
| |
| |
| |
Smoothing 2--Median Smoothing, Resmoothing, and Hanning | |
| |
| |
| |
Extraction | |
| |
| |
| |
Differencing | |
| |
| |
| |
Other Problems | |
| |
| |
| |
Numerating Alpha Values | |
| |
| |
| |
Distribution | |
| |
| |
| |
Normalization | |
| |
| |
| |
Preparing Series Data | |
| |
| |
| |
Looking at the Data | |
| |
| |
| |
Signposts on the Rocky Road | |
| |
| |
| |
Implementation Notes | |
| |
| |
| |
Preparing the Data Set | |
| |
| |
| |
Using Sparsely Populated Variables | |
| |
| |
| |
Increasing Information Density Using Sparsely Populated Variables | |
| |
| |
| |
Binning Sparse Numerical Values | |
| |
| |
| |
Present-Value Patterns (PVPs) | |
| |
| |
| |
Problems with High-Dimensionality Data Sets | |
| |
| |
| |
Information Representation | |
| |
| |
| |
Representing High-Dimensionality Data in Fewer Dimensions | |
| |
| |
| |
Introducing the Neural Network | |
| |
| |
| |
Training a Neural Network | |
| |
| |
| |
Neurons | |
| |
| |
| |
Reshaping the Logistic Curve | |
| |
| |
| |
Single-Input Neurons | |
| |
| |
| |
Multiple-Input Neurons | |
| |
| |
| |
Networking Neurons to Estimate a Function | |
| |
| |
| |
Network Learning | |
| |
| |
| |
Network Prediction--Hidden Layer | |
| |
| |
| |
Network Prediction--Output Layer | |
| |
| |
| |
Stochastic Network Performance | |
| |
| |
| |
Network Architecture 1--The Autoassociative Network | |
| |
| |
| |
Network Architecture 2--The Sparsely Connected Network | |
| |
| |
| |
Compressing Variables | |
| |
| |
| |
Using Compressed Dimensionality Data | |
| |
| |
| |
Removing Variables | |
| |
| |
| |
Estimating Variable Importance 1: What Doesn't Work | |
| |
| |
| |
Estimating Variable Importance 2: Clues | |
| |
| |
| |
Estimating Variable Importance 3: Configuring and Training the Network | |
| |
| |
| |
How Much Data Is Enough? | |
| |
| |
| |
Joint Distribution | |
| |
| |
| |
Capturing Joint Variability | |
| |
| |
| |
Degrees of Freedom | |
| |
| |
| |
Beyond Joint Distribution | |
| |
| |
| |
Enhancing the Data Set | |
| |
| |
| |
Data Sets in Perspective | |
| |
| |
| |
Implementation Notes | |
| |
| |
| |
Collapsing Extremely Sparsely Populated Variables | |
| |
| |
| |
Reducing Excessive Dimensionality | |
| |
| |
| |
Measuring Variable Importance | |
| |
| |
| |
Feature Enhancement | |
| |
| |
| |
Where Next? | |
| |
| |
| |
The Data Survey | |
| |
| |
| |
Introduction to the Data Survey | |
| |
| |
| |
Information and Communication | |
| |
| |
| |
Measuring Information: Signals and Dictionaries | |
| |
| |
| |
Measuring Information: Signals | |
| |
| |
| |
Measuring Information: Bits of Information | |
| |
| |
| |
Measuring Information: Surprise | |
| |
| |
| |
Measuring Information: Entropy | |
| |
| |
| |
Measuring Information: Dictionaries | |
| |
| |
| |
Mapping Using Entropy | |
| |
| |
| |
Whole Data Set Entropy | |
| |
| |
| |
Conditional Entropy between Inputs and Outputs | |
| |
| |
| |
Mutual Information | |
| |
| |
| |
Other Survey Uses for Entropy and Information | |
| |
| |
| |
Looking for Information | |
| |
| |
| |
Identifying Problems with a Data Survey | |
| |
| |
| |
Confidence and Sufficient Data | |
| |
| |
| |
Detecting Sparsity | |
| |
| |
| |
Manifold Definition | |
| |
| |
| |
Clusters | |
| |
| |
| |
Sampling Bias | |
| |
| |
| |
Making the Data Survey | |
| |
| |
| |
Novelty Detection | |
| |
| |
| |
Other Directions | |
| |
| |
Supplemental Material | |
| |
| |
Entropic Analysis--Example | |
| |
| |
Surveying Data Sets | |
| |
| |
| |
Using Prepared Data | |
| |
| |
| |
Modeling Data | |
| |
| |
| |
Assumptions | |
| |
| |
| |
Models | |
| |
| |
| |
Data Mining vs. Exploratory Data Analysis | |
| |
| |
| |
Characterizing Data | |
| |
| |
| |
Decision Trees | |
| |
| |
| |
Clusters | |
| |
| |
| |
Nearest Neighbor | |
| |
| |
| |
Neural Networks and Regression | |
| |
| |
| |
Prepared Data and Modeling Algorithms | |
| |
| |
| |
Neural Networks and the CREDIT Data Set | |
| |
| |
| |
Decision Trees and the CREDIT Data Set | |
| |
| |
| |
Practical Use of Data Preparation and Prepared Data | |
| |
| |
| |
Looking at Present Modeling Tools and Future Directions | |
| |
| |
| |
Near Future | |
| |
| |
| |
Farther Out | |
| |
| |
| |
Using the Demonstration Code on the CD-ROM | |
| |
| |
Further Reading | |
| |
| |
Index | |
| |
| |
About the Author | |
| |
| |
About the CD-ROM | |