Skip to content

Data Preparation for Data Mining

Best in textbook rentals since 2012!

ISBN-10: 1558605290

ISBN-13: 9781558605299

Edition: 1999

Authors: Dorian Pyle

List price: $86.95
Blue ribbon 30 day, 100% satisfaction guarantee!
what's this?
Rush Rewards U
Members Receive:
Carrot Coin icon
XP icon
You have reached 400 XP and carrot coins. That is the daily max!

Description:

This book focuses on the importance of clean, well-structured data as the first step to successful data mining. It shows how data should be prepared prior to mining in order to maximize mining performance.
Customers also bought

Book details

List price: $86.95
Copyright year: 1999
Publisher: Elsevier Science & Technology
Publication date: 4/15/1999
Binding: Paperback
Pages: 560
Size: 7.39" wide x 9.26" long x 1.24" tall
Weight: 2.090
Language: English

Dorian Pyle is Chief Scientist and Founder of PTI (www.pti.com), which develops and markets PowerhouseT predictive and explanatory analytics software. Dorian has over 20 years experience in artificial intelligence and machine learning techniques which are used in what is known today as "data mining" or "predictive analytics". He has applied this knowledge as a consultant with Knowledge Stream Partners, Xchange, Naviant, Thinking Machines, and Data Miners and with various companies directly involved in credit card marketing for banks and with manufacturing companies using industrial automation. In 1976 he was involved in building artificially intelligent machine learning systems utilizing…    

Preface
Introduction
Data Exploration as a Process
The Data Exploration Process
Stage 1: Exploring the Problem Space
Stage 2: Exploring the Solution Space
Stage 3: Specifying the Implementation Method
Stage 4: Mining the Data
Exploration: Mining and Modeling
Data Mining, Modeling, and Modeling Tools
Ten Golden Rules
Introducing Modeling Tools
Types of Models
Active and Passive Models
Explanatory and Predictive Models
Static and Continuously Learning Models
Summary
Supplemental Material
A Continuously Learning Model Application
How the Continuously Learning Model Worked
The Nature of the World and Its Impact on Data Preparation
Measuring the World
Objects
Capturing Measurements
Errors of Measurement
Tying Measurements to the Real World
Types of Measurements
Scalar Measurements
Nonscalar Measurements
Continua of Attributes of Variables
The Qualitative-Quantitative Continuum
The Discrete-Continuous Continuum
Scale Measurement Example
Transformations and Difficulties-- Variables, Data, and Information
Building Mineable Data Representations
Data Representation
Building Data--Dealing with Variables
Building Mineable Data Sets
Summary
Supplemental Material
Combinations
Data Preparation as a Process
Data Preparation: Inputs, Outputs, Models, and Decisions
Step 1: Prepare the Data
Step 2: Survey the Data
Step 3: Model the Data
Use the Model
Modeling Tools and Data Preparation
How Modeling Tools Drive Data Preparation
Decision Trees
Decision Lists
Neural Networks
Evolution Programs
Modeling Data with the Tools
Predictions and Rules
Choosing Techniques
Missing Data and Modeling Tools
Stages of Data Preparation
Stage 1: Accessing the Data
Stage 2: Auditing the Data
Stage 3: Enhancing and Enriching the Data
Stage 4: Looking for Sampling Bias
Stage 5: Determining Data Structure (Super-, Macro-, and Micro-)
Stage 6: Building the PIE
Stage 7: Surveying the Data
Stage 8: Modeling the Data
And the Result Is ...?
Getting the Data: Basic Preparation
Data Discovery
Data Access Issues
Data Characterization
Detail/Aggregation Level (Granularity)
Consistency
Pollution
Objects
Relationship
Domain
Defaults
Integrity
Concurrency
Duplicate or Redundant Variables
Data Set Assembly
Reverse Pivoting
Feature Extraction
Physical or Behavioral Data Sets
Explanatory Structure
Data Enhancement or Enrichment
Sampling Bias
Example 1: Credit
Looking at the Variables
Relationships between Variables
Example 2: Shoe
Looking at the Variables
Relationships between Variables
The Data Assay
Sampling, Variability, and Confidence
Sampling, or First Catch Your Hare!
How Much Data?
Variability
Converging on a Representative Sample
Measuring Variability
Variability and Deviation
Confidence
Variability of Numeric Variables
Variability and Sampling
Variability and Convergence
Variability and Confidence in Alpha Variables
Ordering and Rate of Discovery
Measuring Confidence
Modeling and Confidence with the Whole Population
Testing for Confidence
Confidence Tests and Variability
Confidence in Capturing Variability
A Brief Introduction to the Normal Distribution
Normally Distributed Probabilities
Capturing Normally Distributed Probabilities: An Example
Capturing Confidence, Capturing Variance
Problems and Shortcomings of Taking Samples Using Variability
Missing Values
Constants (Variables with Only One Value)
Problems with Sampling
Monotonic Variable Detection
Interstitial Linearity
Rate of Discovery
Confidence and Instance Count
Summary
Supplemental Material
Confidence Samples
Handling Nonnumerical Variables
Representing Alphas and Remapping
One-of-n Remapping
m-of-n Remapping
Remapping to Eliminate Ordering
Remapping One-to-Many Patterns, or Ill-Formed Problems
Remapping Circular Discontinuity
State Space
Unit State Space
Pythagoras in State Space
Position in State Space
Neighbors and Associates
Density and Sparsity
Nearby and Distant Nearest Neighbors
Normalizing Measured Point Separation
Contours, Peaks, and Valleys
Mapping State Space
Objects in State Space
Phase Space
Mapping Alpha Values
Location, Location, Location!
Numerics, Alphas, and the Montreal Canadiens
Joint Distribution Tables
Two-Way Tables
More Values, More Variables, and Meaning of the Numeration
Dealing with Low-Frequency Alpha Labels and Other Problems
Dimensionality
Multidimensional Scaling
Squashing a Triangle
Projecting Alpha Values
Scree Plots
Practical Consideration--Implementing Alpha Numeration in the Demonstration Code
Implementing Neighborhoods
Implementing Numeration in All Alpha Data Sets
Implementing Dimensionality Reduction for Variables
Summary
Normalizing and Redistributing Variables
Normalizing a Variable's Range
Review of Data Preparation and Modeling (Training, Testing, and Execution)
The Nature and Scope of the Out-of-Range Values Problem
Discovering the Range of Values When Building the PIE
Out-of-Range Values When Training
Out-of-Range Values When Testing
Out-of-Range Values When Executing
Scaling Transformations
Softmax Scaling
Normalizing Ranges
Redistributing Variable Values
The Nature of Distributions
Distributive Difficulties
Adjusting Distributions
Modified Distributions
Summary
Supplemental Material
The Logistic Function
Modifying the Linear Part of the Logistic Function Range
Replacing Missing and Empty Values
Retaining Information about Missing Values
Missing-Value Patterns
Capturing Patterns
Replacing Missing Values
Unbiased Estimators
Variability Relationships
Relationships between Variables
Preserving Between-Variable Relationships
Summary
Supplemental Material
Using Regression to Find Least Information-Damaging Missing Values
Alternative Methods of Missing-Value Replacement
Series Variables
Here There Be Dragons!
Types of Series
Describing Series Data
Constructing a Series
Features of a Series
Describing a Series--Fourier
Describing a Series--Spectrum
Describing a Series--Trend, Seasonality, Cycles, Noise
Describing a Series--Autocorrelation
Modeling Series Data
Repairing Series Data Problems
Missing Values
Outliers
Nonuniform Displacement
Trend
Tools
Filtering
Moving Averages
Smoothing 1--PVM Smoothing
Smoothing 2--Median Smoothing, Resmoothing, and Hanning
Extraction
Differencing
Other Problems
Numerating Alpha Values
Distribution
Normalization
Preparing Series Data
Looking at the Data
Signposts on the Rocky Road
Implementation Notes
Preparing the Data Set
Using Sparsely Populated Variables
Increasing Information Density Using Sparsely Populated Variables
Binning Sparse Numerical Values
Present-Value Patterns (PVPs)
Problems with High-Dimensionality Data Sets
Information Representation
Representing High-Dimensionality Data in Fewer Dimensions
Introducing the Neural Network
Training a Neural Network
Neurons
Reshaping the Logistic Curve
Single-Input Neurons
Multiple-Input Neurons
Networking Neurons to Estimate a Function
Network Learning
Network Prediction--Hidden Layer
Network Prediction--Output Layer
Stochastic Network Performance
Network Architecture 1--The Autoassociative Network
Network Architecture 2--The Sparsely Connected Network
Compressing Variables
Using Compressed Dimensionality Data
Removing Variables
Estimating Variable Importance 1: What Doesn't Work
Estimating Variable Importance 2: Clues
Estimating Variable Importance 3: Configuring and Training the Network
How Much Data Is Enough?
Joint Distribution
Capturing Joint Variability
Degrees of Freedom
Beyond Joint Distribution
Enhancing the Data Set
Data Sets in Perspective
Implementation Notes
Collapsing Extremely Sparsely Populated Variables
Reducing Excessive Dimensionality
Measuring Variable Importance
Feature Enhancement
Where Next?
The Data Survey
Introduction to the Data Survey
Information and Communication
Measuring Information: Signals and Dictionaries
Measuring Information: Signals
Measuring Information: Bits of Information
Measuring Information: Surprise
Measuring Information: Entropy
Measuring Information: Dictionaries
Mapping Using Entropy
Whole Data Set Entropy
Conditional Entropy between Inputs and Outputs
Mutual Information
Other Survey Uses for Entropy and Information
Looking for Information
Identifying Problems with a Data Survey
Confidence and Sufficient Data
Detecting Sparsity
Manifold Definition
Clusters
Sampling Bias
Making the Data Survey
Novelty Detection
Other Directions
Supplemental Material
Entropic Analysis--Example
Surveying Data Sets
Using Prepared Data
Modeling Data
Assumptions
Models
Data Mining vs. Exploratory Data Analysis
Characterizing Data
Decision Trees
Clusters
Nearest Neighbor
Neural Networks and Regression
Prepared Data and Modeling Algorithms
Neural Networks and the CREDIT Data Set
Decision Trees and the CREDIT Data Set
Practical Use of Data Preparation and Prepared Data
Looking at Present Modeling Tools and Future Directions
Near Future
Farther Out
Using the Demonstration Code on the CD-ROM
Further Reading
Index
About the Author
About the CD-ROM