Data Preparation for Data Mining

Name: Data Preparation for Data Mining
Price: 7.74 USD
Availability: InStock
ISBN: 9781558605299

ISBN-10: 1558605290

ISBN-13: 9781558605299

Edition: 1999

Authors: Dorian Pyle

List price: $86.95

30 day, 100% satisfaction guarantee!

Marketplace

2 new & used from $7.74

what's this?

Rush Rewards U
Members Receive:

You have reached 400 XP and carrot coins. That is the daily max!

Description:

This book focuses on the importance of clean, well-structured data as the first step to successful data mining. It shows how data should be prepared prior to mining in order to maximize mining performance.

Book details

List price: $86.95
Copyright year: 1999
Publisher: Elsevier Science & Technology
Publication date: 4/15/1999
Binding: Paperback
Pages: 560
Size: 7.39" wide x 9.26" long x 1.24" tall
Weight: 2.090
Language: English

Dorian Pyle is Chief Scientist and Founder of PTI (www.pti.com), which develops and markets PowerhouseT predictive and explanatory analytics software. Dorian has over 20 years experience in artificial intelligence and machine learning techniques which are used in what is known today as "data mining" or "predictive analytics". He has applied this knowledge as a consultant with Knowledge Stream Partners, Xchange, Naviant, Thinking Machines, and Data Miners and with various companies directly involved in credit card marketing for banks and with manufacturing companies using industrial automation. In 1976 he was involved in building artificially intelligent machine learning systems utilizing…



Preface


Introduction



Data Exploration as a Process



The Data Exploration Process



Stage 1: Exploring the Problem Space



Stage 2: Exploring the Solution Space



Stage 3: Specifying the Implementation Method



Stage 4: Mining the Data



Exploration: Mining and Modeling



Data Mining, Modeling, and Modeling Tools



Ten Golden Rules



Introducing Modeling Tools



Types of Models



Active and Passive Models



Explanatory and Predictive Models



Static and Continuously Learning Models



Summary


Supplemental Material


A Continuously Learning Model Application


How the Continuously Learning Model Worked



The Nature of the World and Its Impact on Data Preparation



Measuring the World



Objects



Capturing Measurements



Errors of Measurement



Tying Measurements to the Real World



Types of Measurements



Scalar Measurements



Nonscalar Measurements



Continua of Attributes of Variables



The Qualitative-Quantitative Continuum



The Discrete-Continuous Continuum



Scale Measurement Example



Transformations and Difficulties-- Variables, Data, and Information



Building Mineable Data Representations



Data Representation



Building Data--Dealing with Variables



Building Mineable Data Sets



Summary


Supplemental Material


Combinations



Data Preparation as a Process



Data Preparation: Inputs, Outputs, Models, and Decisions



Step 1: Prepare the Data



Step 2: Survey the Data



Step 3: Model the Data



Use the Model



Modeling Tools and Data Preparation



How Modeling Tools Drive Data Preparation



Decision Trees



Decision Lists



Neural Networks



Evolution Programs



Modeling Data with the Tools



Predictions and Rules



Choosing Techniques



Missing Data and Modeling Tools



Stages of Data Preparation



Stage 1: Accessing the Data



Stage 2: Auditing the Data



Stage 3: Enhancing and Enriching the Data



Stage 4: Looking for Sampling Bias



Stage 5: Determining Data Structure (Super-, Macro-, and Micro-)



Stage 6: Building the PIE



Stage 7: Surveying the Data



Stage 8: Modeling the Data



And the Result Is ...?



Getting the Data: Basic Preparation



Data Discovery



Data Access Issues



Data Characterization



Detail/Aggregation Level (Granularity)



Consistency



Pollution



Objects



Relationship



Domain



Defaults



Integrity



Concurrency



Duplicate or Redundant Variables



Data Set Assembly



Reverse Pivoting



Feature Extraction



Physical or Behavioral Data Sets



Explanatory Structure



Data Enhancement or Enrichment



Sampling Bias



Example 1: Credit



Looking at the Variables



Relationships between Variables



Example 2: Shoe



Looking at the Variables



Relationships between Variables



The Data Assay



Sampling, Variability, and Confidence



Sampling, or First Catch Your Hare!



How Much Data?



Variability



Converging on a Representative Sample



Measuring Variability



Variability and Deviation



Confidence



Variability of Numeric Variables



Variability and Sampling



Variability and Convergence



Variability and Confidence in Alpha Variables



Ordering and Rate of Discovery



Measuring Confidence



Modeling and Confidence with the Whole Population



Testing for Confidence



Confidence Tests and Variability



Confidence in Capturing Variability



A Brief Introduction to the Normal Distribution



Normally Distributed Probabilities



Capturing Normally Distributed Probabilities: An Example



Capturing Confidence, Capturing Variance



Problems and Shortcomings of Taking Samples Using Variability



Missing Values



Constants (Variables with Only One Value)



Problems with Sampling



Monotonic Variable Detection



Interstitial Linearity



Rate of Discovery



Confidence and Instance Count



Summary


Supplemental Material


Confidence Samples



Handling Nonnumerical Variables



Representing Alphas and Remapping



One-of-n Remapping



m-of-n Remapping



Remapping to Eliminate Ordering



Remapping One-to-Many Patterns, or Ill-Formed Problems



Remapping Circular Discontinuity



State Space



Unit State Space



Pythagoras in State Space



Position in State Space



Neighbors and Associates



Density and Sparsity



Nearby and Distant Nearest Neighbors



Normalizing Measured Point Separation



Contours, Peaks, and Valleys



Mapping State Space



Objects in State Space



Phase Space



Mapping Alpha Values



Location, Location, Location!



Numerics, Alphas, and the Montreal Canadiens



Joint Distribution Tables



Two-Way Tables



More Values, More Variables, and Meaning of the Numeration



Dealing with Low-Frequency Alpha Labels and Other Problems



Dimensionality



Multidimensional Scaling



Squashing a Triangle



Projecting Alpha Values



Scree Plots



Practical Consideration--Implementing Alpha Numeration in the Demonstration Code



Implementing Neighborhoods



Implementing Numeration in All Alpha Data Sets



Implementing Dimensionality Reduction for Variables



Summary



Normalizing and Redistributing Variables



Normalizing a Variable's Range



Review of Data Preparation and Modeling (Training, Testing, and Execution)



The Nature and Scope of the Out-of-Range Values Problem



Discovering the Range of Values When Building the PIE



Out-of-Range Values When Training



Out-of-Range Values When Testing



Out-of-Range Values When Executing



Scaling Transformations



Softmax Scaling



Normalizing Ranges



Redistributing Variable Values



The Nature of Distributions



Distributive Difficulties



Adjusting Distributions



Modified Distributions



Summary


Supplemental Material


The Logistic Function


Modifying the Linear Part of the Logistic Function Range



Replacing Missing and Empty Values



Retaining Information about Missing Values



Missing-Value Patterns



Capturing Patterns



Replacing Missing Values



Unbiased Estimators



Variability Relationships



Relationships between Variables



Preserving Between-Variable Relationships



Summary


Supplemental Material


Using Regression to Find Least Information-Damaging Missing Values


Alternative Methods of Missing-Value Replacement



Series Variables



Here There Be Dragons!



Types of Series



Describing Series Data



Constructing a Series



Features of a Series



Describing a Series--Fourier



Describing a Series--Spectrum



Describing a Series--Trend, Seasonality, Cycles, Noise



Describing a Series--Autocorrelation



Modeling Series Data



Repairing Series Data Problems



Missing Values



Outliers



Nonuniform Displacement



Trend



Tools



Filtering



Moving Averages



Smoothing 1--PVM Smoothing



Smoothing 2--Median Smoothing, Resmoothing, and Hanning



Extraction



Differencing



Other Problems



Numerating Alpha Values



Distribution



Normalization



Preparing Series Data



Looking at the Data



Signposts on the Rocky Road



Implementation Notes



Preparing the Data Set



Using Sparsely Populated Variables



Increasing Information Density Using Sparsely Populated Variables



Binning Sparse Numerical Values



Present-Value Patterns (PVPs)



Problems with High-Dimensionality Data Sets



Information Representation



Representing High-Dimensionality Data in Fewer Dimensions



Introducing the Neural Network



Training a Neural Network



Neurons



Reshaping the Logistic Curve



Single-Input Neurons



Multiple-Input Neurons



Networking Neurons to Estimate a Function



Network Learning



Network Prediction--Hidden Layer



Network Prediction--Output Layer



Stochastic Network Performance



Network Architecture 1--The Autoassociative Network



Network Architecture 2--The Sparsely Connected Network



Compressing Variables



Using Compressed Dimensionality Data



Removing Variables



Estimating Variable Importance 1: What Doesn't Work



Estimating Variable Importance 2: Clues



Estimating Variable Importance 3: Configuring and Training the Network



How Much Data Is Enough?



Joint Distribution



Capturing Joint Variability



Degrees of Freedom



Beyond Joint Distribution



Enhancing the Data Set



Data Sets in Perspective



Implementation Notes



Collapsing Extremely Sparsely Populated Variables



Reducing Excessive Dimensionality



Measuring Variable Importance



Feature Enhancement



Where Next?



The Data Survey



Introduction to the Data Survey



Information and Communication



Measuring Information: Signals and Dictionaries



Measuring Information: Signals



Measuring Information: Bits of Information



Measuring Information: Surprise



Measuring Information: Entropy



Measuring Information: Dictionaries



Mapping Using Entropy



Whole Data Set Entropy



Conditional Entropy between Inputs and Outputs



Mutual Information



Other Survey Uses for Entropy and Information



Looking for Information



Identifying Problems with a Data Survey



Confidence and Sufficient Data



Detecting Sparsity



Manifold Definition



Clusters



Sampling Bias



Making the Data Survey



Novelty Detection



Other Directions


Supplemental Material


Entropic Analysis--Example


Surveying Data Sets



Using Prepared Data



Modeling Data



Assumptions



Models



Data Mining vs. Exploratory Data Analysis



Characterizing Data



Decision Trees



Clusters



Nearest Neighbor



Neural Networks and Regression



Prepared Data and Modeling Algorithms



Neural Networks and the CREDIT Data Set



Decision Trees and the CREDIT Data Set



Practical Use of Data Preparation and Prepared Data



Looking at Present Modeling Tools and Future Directions



Near Future



Farther Out



Using the Demonstration Code on the CD-ROM


Further Reading


Index


About the Author


About the CD-ROM