Data Preparation for Data Mining

ISBN-10: 1558605290

ISBN-13: 9781558605299

Edition: 1999

Authors: Dorian Pyle

List price: $86.95
30 day, 100% satisfaction guarantee

If an item you ordered from TextbookRush does not meet your expectations due to an error on our part, simply fill out a return request and then return it by mail within 30 days of ordering it for a full refund of item cost.

Learn more about our returns policy


This book focuses on the importance of clean, well-structured data as the first step to successful data mining. It shows how data should be prepared prior to mining in order to maximize mining performance.
what's this?
Rush Rewards U
Members Receive:
You have reached 400 XP and carrot coins. That is the daily max!
Study Briefs

Limited time offer: Get the first one free! (?)

All the information you need in one place! Each Study Brief is a summary of one specific subject; facts, figures, and explanations to help you learn faster.

Add to cart
Study Briefs
SQL Online content $4.95 $1.99
Add to cart
Study Briefs
MS Excel® 2010 Online content $4.95 $1.99
Add to cart
Study Briefs
MS Word® 2010 Online content $4.95 $1.99
Add to cart
Study Briefs
MS PowerPoint® 2010 Online content $4.95 $1.99
Customers also bought

Book details

List price: $86.95
Copyright year: 1999
Publisher: Elsevier Science & Technology Books
Publication date: 3/22/1999
Binding: Paperback
Pages: 560
Size: 7.25" wide x 9.25" long x 1.25" tall
Weight: 2.090
Language: English

Dorian Pyle is Chief Scientist and Founder of PTI (, which develops and markets PowerhouseT predictive and explanatory analytics software. Dorian has over 20 years experience in artificial intelligence and machine learning techniques which are used in what is known today as "data mining" or "predictive analytics". He has applied this knowledge as a consultant with Knowledge Stream Partners, Xchange, Naviant, Thinking Machines, and Data Miners and with various companies directly involved in credit card marketing for banks and with manufacturing companies using industrial automation. In 1976 he was involved in building artificially intelligent machine learning systems utilizing the pioneering technologies that are currently known as neural computing and associative memories. He is current in and familiar with using the most advanced technologies in data mining including: entropic analysis (information theory), chaotic and fractal decomposition, neural technologies, evolution and genetic optimization, algebra evolvers, case-based reasoning, concept induction and other advanced statistical techniques.

Data Exploration as a Process
The Data Exploration Process
Stage 1: Exploring the Problem Space
Stage 2: Exploring the Solution Space
Stage 3: Specifying the Implementation Method
Stage 4: Mining the Data
Exploration: Mining and Modeling
Data Mining, Modeling, and Modeling Tools
Ten Golden Rules
Introducing Modeling Tools
Types of Models
Active and Passive Models
Explanatory and Predictive Models
Static and Continuously Learning Models
Supplemental Material
A Continuously Learning Model Application
How the Continuously Learning Model Worked
The Nature of the World and Its Impact on Data Preparation
Measuring the World
Capturing Measurements
Errors of Measurement
Tying Measurements to the Real World
Types of Measurements
Scalar Measurements
Nonscalar Measurements
Continua of Attributes of Variables
The Qualitative-Quantitative Continuum
The Discrete-Continuous Continuum
Scale Measurement Example
Transformations and Difficulties-- Variables, Data, and Information
Building Mineable Data Representations
Data Representation
Building Data--Dealing with Variables
Building Mineable Data Sets
Supplemental Material
Data Preparation as a Process
Data Preparation: Inputs, Outputs, Models, and Decisions
Step 1: Prepare the Data
Step 2: Survey the Data
Step 3: Model the Data
Use the Model
Modeling Tools and Data Preparation
How Modeling Tools Drive Data Preparation
Decision Trees
Decision Lists
Neural Networks
Evolution Programs
Modeling Data with the Tools
Predictions and Rules
Choosing Techniques
Missing Data and Modeling Tools
Stages of Data Preparation
Stage 1: Accessing the Data
Stage 2: Auditing the Data
Stage 3: Enhancing and Enriching the Data
Stage 4: Looking for Sampling Bias
Stage 5: Determining Data Structure (Super-, Macro-, and Micro-)
Stage 6: Building the PIE
Stage 7: Surveying the Data
Stage 8: Modeling the Data
And the Result Is ...?
Getting the Data: Basic Preparation
Data Discovery
Data Access Issues
Data Characterization
Detail/Aggregation Level (Granularity)
Duplicate or Redundant Variables
Data Set Assembly
Reverse Pivoting
Feature Extraction
Physical or Behavioral Data Sets
Explanatory Structure
Data Enhancement or Enrichment
Sampling Bias
Example 1: Credit
Looking at the Variables
Relationships between Variables
Example 2: Shoe
Looking at the Variables
Relationships between Variables
The Data Assay
Sampling, Variability, and Confidence
Sampling, or First Catch Your Hare!
How Much Data?
Converging on a Representative Sample
Measuring Variability
Variability and Deviation
Variability of Numeric Variables
Variability and Sampling
Variability and Convergence
Variability and Confidence in Alpha Variables
Ordering and Rate of Discovery
Measuring Confidence
Modeling and Confidence with the Whole Population
Testing for Confidence
Confidence Tests and Variability
Confidence in Capturing Variability
A Brief Introduction to the Normal Distribution
Normally Distributed Probabilities
Capturing Normally Distributed Probabilities: An Example
Capturing Confidence, Capturing Variance
Problems and Shortcomings of Taking Samples Using Variability
Missing Values
Constants (Variables with Only One Value)
Problems with Sampling
Monotonic Variable Detection
Interstitial Linearity
Rate of Discovery
Confidence and Instance Count
Supplemental Material
Confidence Samples
Handling Nonnumerical Variables
Representing Alphas and Remapping
One-of-n Remapping
m-of-n Remapping
Remapping to Eliminate Ordering
Remapping One-to-Many Patterns, or Ill-Formed Problems
Remapping Circular Discontinuity
State Space
Unit State Space
Pythagoras in State Space
Position in State Space
Neighbors and Associates
Density and Sparsity
Nearby and Distant Nearest Neighbors
Normalizing Measured Point Separation
Contours, Peaks, and Valleys
Mapping State Space
Objects in State Space
Phase Space
Mapping Alpha Values
Location, Location, Location!
Numerics, Alphas, and the Montreal Canadiens
Joint Distribution Tables
Two-Way Tables
More Values, More Variables, and Meaning of the Numeration
Dealing with Low-Frequency Alpha Labels and Other Problems
Multidimensional Scaling
Squashing a Triangle
Projecting Alpha Values
Scree Plots
Practical Consideration--Implementing Alpha Numeration in the Demonstration Code
Implementing Neighborhoods
Implementing Numeration in All Alpha Data Sets
Implementing Dimensionality Reduction for Variables
Normalizing and Redistributing Variables
Normalizing a Variable's Range
Review of Data Preparation and Modeling (Training, Testing, and Execution)
The Nature and Scope of the Out-of-Range Values Problem
Discovering the Range of Values When Building the PIE
Out-of-Range Values When Training
Out-of-Range Values When Testing
Out-of-Range Values When Executing
Scaling Transformations
Softmax Scaling
Normalizing Ranges
Redistributing Variable Values
The Nature of Distributions
Distributive Difficulties
Adjusting Distributions
Modified Distributions
Supplemental Material
The Logistic Function
Modifying the Linear Part of the Logistic Function Range
Replacing Missing and Empty Values
Retaining Information about Missing Values
Missing-Value Patterns
Capturing Patterns
Replacing Missing Values
Unbiased Estimators
Variability Relationships
Relationships between Variables
Preserving Between-Variable Relationships
Supplemental Material
Using Regression to Find Least Information-Damaging Missing Values
Alternative Methods of Missing-Value Replacement
Series Variables
Here There Be Dragons!
Types of Series
Describing Series Data
Constructing a Series
Features of a Series
Describing a Series--Fourier
Describing a Series--Spectrum
Describing a Series--Trend, Seasonality, Cycles, Noise
Describing a Series--Autocorrelation
Modeling Series Data
Repairing Series Data Problems
Missing Values
Nonuniform Displacement
Moving Averages
Smoothing 1--PVM Smoothing
Smoothing 2--Median Smoothing, Resmoothing, and Hanning
Other Problems
Numerating Alpha Values
Preparing Series Data
Looking at the Data
Signposts on the Rocky Road
Implementation Notes
Preparing the Data Set
Using Sparsely Populated Variables
Increasing Information Density Using Sparsely Populated Variables
Binning Sparse Numerical Values
Present-Value Patterns (PVPs)
Problems with High-Dimensionality Data Sets
Information Representation
Representing High-Dimensionality Data in Fewer Dimensions
Introducing the Neural Network
Training a Neural Network
Reshaping the Logistic Curve
Single-Input Neurons
Multiple-Input Neurons
Networking Neurons to Estimate a Function
Network Learning
Network Prediction--Hidden Layer
Network Prediction--Output Layer
Stochastic Network Performance
Network Architecture 1--The Autoassociative Network
Network Architecture 2--The Sparsely Connected Network
Compressing Variables
Using Compressed Dimensionality Data
Removing Variables
Estimating Variable Importance 1: What Doesn't Work
Estimating Variable Importance 2: Clues
Estimating Variable Importance 3: Configuring and Training the Network
How Much Data Is Enough?
Joint Distribution
Capturing Joint Variability
Degrees of Freedom
Beyond Joint Distribution
Enhancing the Data Set
Data Sets in Perspective
Implementation Notes
Collapsing Extremely Sparsely Populated Variables
Reducing Excessive Dimensionality
Measuring Variable Importance
Feature Enhancement
Where Next?
The Data Survey
Introduction to the Data Survey
Information and Communication
Measuring Information: Signals and Dictionaries
Measuring Information: Signals
Measuring Information: Bits of Information
Measuring Information: Surprise
Measuring Information: Entropy
Measuring Information: Dictionaries
Mapping Using Entropy
Whole Data Set Entropy
Conditional Entropy between Inputs and Outputs
Mutual Information
Other Survey Uses for Entropy and Information
Looking for Information
Identifying Problems with a Data Survey
Confidence and Sufficient Data
Detecting Sparsity
Manifold Definition
Sampling Bias
Making the Data Survey
Novelty Detection
Other Directions
Supplemental Material
Entropic Analysis--Example
Surveying Data Sets
Using Prepared Data
Modeling Data
Data Mining vs. Exploratory Data Analysis
Characterizing Data
Decision Trees
Nearest Neighbor
Neural Networks and Regression
Prepared Data and Modeling Algorithms
Neural Networks and the CREDIT Data Set
Decision Trees and the CREDIT Data Set
Practical Use of Data Preparation and Prepared Data
Looking at Present Modeling Tools and Future Directions
Near Future
Farther Out
Using the Demonstration Code on the CD-ROM
Further Reading
About the Author
About the CD-ROM
Free shipping on orders over $35*

*A minimum purchase of $35 is required. Shipping is provided via FedEx SmartPost® and FedEx Express Saver®. Average delivery time is 1 – 5 business days, but is not guaranteed in that timeframe. Also allow 1 - 2 days for processing. Free shipping is eligible only in the continental United States and excludes Hawaii, Alaska and Puerto Rico. FedEx service marks used by permission."Marketplace" orders are not eligible for free or discounted shipping.

Learn more about the TextbookRush Marketplace.