| |
| |
Preface | |
| |
| |
| |
Introduction | |
| |
| |
| |
Machine Perception | |
| |
| |
| |
An Example | |
| |
| |
| |
Related Fields | |
| |
| |
| |
Pattern Recognition Systems | |
| |
| |
| |
Sensing | |
| |
| |
| |
Segmentation and Grouping | |
| |
| |
| |
Feature Extraction | |
| |
| |
| |
Classification | |
| |
| |
| |
Post Processing | |
| |
| |
| |
The Design Cycle | |
| |
| |
| |
Data Collection | |
| |
| |
| |
Feature Choice | |
| |
| |
| |
Model Choice | |
| |
| |
| |
Training | |
| |
| |
| |
Evaluation | |
| |
| |
| |
Computational Complexity | |
| |
| |
| |
Learning and Adaptation | |
| |
| |
| |
Supervised Learning | |
| |
| |
| |
Unsupervised Learning | |
| |
| |
| |
Reinforcement Learning | |
| |
| |
| |
Conclusion | |
| |
| |
Summary by Chapters | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Bibliography | |
| |
| |
| |
Bayesian Decision Theory | |
| |
| |
| |
Introduction | |
| |
| |
| |
Bayesian Decision Theory--Continuous Features | |
| |
| |
| |
Two-Category Classification | |
| |
| |
| |
Minimum-Error-Rate Classification | |
| |
| |
| |
Minimax Criterion | |
| |
| |
| |
Neyman-Pearson Criterion | |
| |
| |
| |
Classifiers, Discriminant Functions, and Decision Surfaces | |
| |
| |
| |
The Multicategory Case | |
| |
| |
| |
The Two-Category Case | |
| |
| |
| |
The Normal Density | |
| |
| |
| |
Univariate Density | |
| |
| |
| |
Multivariate Density | |
| |
| |
| |
Discriminant Functions for the Normal Density | |
| |
| |
| |
Case 1: [Sigma subscript i] = [sigma superscript 2]I | |
| |
| |
| |
Case 2: [Sigma ubscript i] = [Sigma] | |
| |
| |
| |
Case 3: [Sigma subscript i] = arbitrary | |
| |
| |
| |
Decision Regions for Two-Dimensional Gaussian Data | |
| |
| |
| |
Error Probabilities and Integrals | |
| |
| |
| |
Error Bounds for Normal Densities | |
| |
| |
| |
Chernoff Bound | |
| |
| |
| |
Bhattacharyya Bound | |
| |
| |
| |
Error Bounds for Gaussian Distributions | |
| |
| |
| |
Signal Detection Theory and Operating Characteristics | |
| |
| |
| |
Bayes Decision Theory--Discrete Features | |
| |
| |
| |
Independent Binary Features | |
| |
| |
| |
Bayesian Decisions for Three-Dimensional Binary Data | |
| |
| |
| |
Missing and Noisy Features | |
| |
| |
| |
Missing Features | |
| |
| |
| |
Noisy Features | |
| |
| |
| |
Bayesian Belief Networks | |
| |
| |
| |
Belief Network for Fish | |
| |
| |
| |
Compound Bayesian Decision Theory and Context | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Maximum-Likelihood and Bayesian Parameter Estimation | |
| |
| |
| |
Introduction | |
| |
| |
| |
Maximum-Likelihood Estimation | |
| |
| |
| |
The General Principle | |
| |
| |
| |
The Gaussian Case: Unknown [mu] | |
| |
| |
| |
The Gaussian Case: Unknown [mu] and [Sigma] | |
| |
| |
| |
Bias | |
| |
| |
| |
Bayesian Estimation | |
| |
| |
| |
The Class-Conditional Densities | |
| |
| |
| |
The Parameter Distribution | |
| |
| |
| |
Bayesian Parameter Estimation: Gaussian Case | |
| |
| |
| |
The Univariate Case: p([mu]D) | |
| |
| |
| |
The Univariate Case: p(xD) | |
| |
| |
| |
The Multivariate Case | |
| |
| |
| |
Bayesian Parameter Estimation: General Theory | |
| |
| |
| |
Recursive Bayes Learning | |
| |
| |
| |
When Do Maximum-Likelihood and Bayes Methods Differ? | |
| |
| |
| |
Noninformative Priors and Invariance | |
| |
| |
| |
Gibbs Algorithm | |
| |
| |
| |
Sufficient Statistics | |
| |
| |
| |
Sufficient Statistics and the Exponential Family | |
| |
| |
| |
Problems of Dimensionality | |
| |
| |
| |
Accuracy, Dimension, and Training Sample Size | |
| |
| |
| |
Computational Complexity | |
| |
| |
| |
Overfitting | |
| |
| |
| |
Component Analysis and Discriminants | |
| |
| |
| |
Principal Component Analysis (PCA) | |
| |
| |
| |
Fisher Linear Discriminant | |
| |
| |
| |
Multiple Discriminant Analysis | |
| |
| |
| |
Expectation-Maximization (EM) | |
| |
| |
| |
Expectation-Maximization for a 2D Normal Model | |
| |
| |
| |
Hidden Markov Models | |
| |
| |
| |
First-Order Markov Models | |
| |
| |
| |
First-Order Hidden Markov Models | |
| |
| |
| |
Hidden Markov Model Computation | |
| |
| |
| |
Evaluation | |
| |
| |
| |
Hidden Markov Model | |
| |
| |
| |
Decoding | |
| |
| |
| |
HMM Decoding | |
| |
| |
| |
Learning | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Nonparametric Techniques | |
| |
| |
| |
Introduction | |
| |
| |
| |
Density Estimation | |
| |
| |
| |
Parzen Windows | |
| |
| |
| |
Convergence of the Mean | |
| |
| |
| |
Convergence of the Variance | |
| |
| |
| |
Illustrations | |
| |
| |
| |
Classification Example | |
| |
| |
| |
Probabilistic Neural Networks (PNNs) | |
| |
| |
| |
Choosing the Window Function | |
| |
| |
| |
k[subscript n]-Nearest-Neighbor Estimation | |
| |
| |
| |
k[subscript n]-Nearest-Neighbor and Parzen-Window Estimation | |
| |
| |
| |
Estimation of A Posteriori Probabilities | |
| |
| |
| |
The Nearest-Neighbor Rule | |
| |
| |
| |
Convergence of the Nearest Neighbor | |
| |
| |
| |
Error Rate for the Nearest-Neighbor Rule | |
| |
| |
| |
Error Bounds | |
| |
| |
| |
The k-Nearest-Neighbor Rule | |
| |
| |
| |
Computational Complexity of the k-Nearest-Neighbor Rule | |
| |
| |
| |
Metrics and Nearest-Neighbor Classification | |
| |
| |
| |
Properties of Metrics | |
| |
| |
| |
Tangent Distance | |
| |
| |
| |
Fuzzy Classification | |
| |
| |
| |
Reduced Coulomb Energy Networks | |
| |
| |
| |
Approximations by Series Expansions | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Linear Discriminant Functions | |
| |
| |
| |
Introduction | |
| |
| |
| |
Linear Discriminant Functions and Decision Surfaces | |
| |
| |
| |
The Two-Category Case | |
| |
| |
| |
The Multicategory Case | |
| |
| |
| |
Generalized Linear Discriminant Functions | |
| |
| |
| |
The Two-Category Linearly Separable Case | |
| |
| |
| |
Geometry and Terminology | |
| |
| |
| |
Gradient Descent Procedures | |
| |
| |
| |
Minimizing the Perceptron Criterion Function | |
| |
| |
| |
The Perceptron Criterion Function | |
| |
| |
| |
Convergence Proof for Single-Sample Correction | |
| |
| |
| |
Some Direct Generalizations | |
| |
| |
| |
Relaxation Procedures | |
| |
| |
| |
The Descent Algorithm | |
| |
| |
| |
Convergence Proof | |
| |
| |
| |
Nonseparable Behavior | |
| |
| |
| |
Minimum Squared-Error Procedures | |
| |
| |
| |
Minimum Squared-Error and the Pseudoinverse | |
| |
| |
| |
Constructing a Linear Classifier by Matrix Pseudoinverse | |
| |
| |
| |
Relation to Fisher's Linear Discriminant | |
| |
| |
| |
Asymptotic Approximation to an Optimal Discriminant | |
| |
| |
| |
The Widrow-Hoff or LMS Procedure | |
| |
| |
| |
Stochastic Approximation Methods | |
| |
| |
| |
The Ho-Kashyap Procedures | |
| |
| |
| |
The Descent Procedure | |
| |
| |
| |
Convergence Proof | |
| |
| |
| |
Nonseparable Behavior | |
| |
| |
| |
Some Related Procedures | |
| |
| |
| |
Linear Programming Algorithms | |
| |
| |
| |
Linear Programming | |
| |
| |
| |
The Linearly Separable Case | |
| |
| |
| |
Minimizing the Perceptron Criterion Function | |
| |
| |
| |
Support Vector Machines | |
| |
| |
| |
SVM Training | |
| |
| |
| |
SVM for the XOR Problem | |
| |
| |
| |
Multicategory Generalizations | |
| |
| |
| |
Kesler's Construction | |
| |
| |
| |
Convergence of the Fixed-Increment Rule | |
| |
| |
| |
Generalizations for MSE Procedures | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Multilayer Neural Networks | |
| |
| |
| |
Introduction | |
| |
| |
| |
Feedforward Operation and Classification | |
| |
| |
| |
General Feedforward Operation | |
| |
| |
| |
Expressive Power of Multilayer Networks | |
| |
| |
| |
Backpropagation Algorithm | |
| |
| |
| |
Network Learning | |
| |
| |
| |
Training Protocols | |
| |
| |
| |
Learning Curves | |
| |
| |
| |
Error Surfaces | |
| |
| |
| |
Some Small Networks | |
| |
| |
| |
The Exclusive-OR (XOR) | |
| |
| |
| |
Larger Networks | |
| |
| |
| |
How Important Are Multiple Minima? | |
| |
| |
| |
Backpropagation as Feature Mapping | |
| |
| |
| |
Representations at the Hidden Layer--Weights | |
| |
| |
| |
Backpropagation, Bayes Theory and Probability | |
| |
| |
| |
Bayes Discriminants and Neural Networks | |
| |
| |
| |
Outputs as Probabilities | |
| |
| |
| |
Related Statistical Techniques | |
| |
| |
| |
Practical Techniques for Improving Backpropagation | |
| |
| |
| |
Activation Function | |
| |
| |
| |
Parameters for the Sigmoid | |
| |
| |
| |
Scaling Input | |
| |
| |
| |
Target Values | |
| |
| |
| |
Training with Noise | |
| |
| |
| |
Manufacturing Data | |
| |
| |
| |
Number of Hidden Units | |
| |
| |
| |
Initializing Weights | |
| |
| |
| |
Learning Rates | |
| |
| |
| |
Momentum | |
| |
| |
| |
Weight Decay | |
| |
| |
| |
Hints | |
| |
| |
| |
On-Line, Stochastic or Batch Training? | |
| |
| |
| |
Stopped Training | |
| |
| |
| |
Number of Hidden Layers | |
| |
| |
| |
Criterion Function | |
| |
| |
| |
Second-Order Methods | |
| |
| |
| |
Hessian Matrix | |
| |
| |
| |
Newton's Method | |
| |
| |
| |
Quickprop | |
| |
| |
| |
Conjugate Gradient Descent | |
| |
| |
| |
Conjugate Gradient Descent | |
| |
| |
| |
Additional Networks and Training Methods | |
| |
| |
| |
Radial Basis Function Networks (RBFs) | |
| |
| |
| |
Special Bases | |
| |
| |
| |
Matched Filters | |
| |
| |
| |
Convolutional Networks | |
| |
| |
| |
Recurrent Networks | |
| |
| |
| |
Cascade-Correlation | |
| |
| |
| |
Regularization, Complexity Adjustment and Pruning | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Stochastic Methods | |
| |
| |
| |
Introduction | |
| |
| |
| |
Stochastic Search | |
| |
| |
| |
Simulated Annealing | |
| |
| |
| |
The Boltzmann Factor | |
| |
| |
| |
Deterministic Simulated Annealing | |
| |
| |
| |
Boltzmann Learning | |
| |
| |
| |
Stochastic Boltzmann Learning of Visible States | |
| |
| |
| |
Missing Features and Category Constraints | |
| |
| |
| |
Deterministic Boltzmann Learning | |
| |
| |
| |
Initialization and Setting Parameters | |
| |
| |
| |
Boltzmann Networks and Graphical Models | |
| |
| |
| |
Other Graphical Models | |
| |
| |
| |
Evolutionary Methods | |
| |
| |
| |
Genetic Algorithms | |
| |
| |
| |
Further Heuristics | |
| |
| |
| |
Why Do They Work? | |
| |
| |
| |
Genetic Programming | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Nonmetric Methods | |
| |
| |
| |
Introduction | |
| |
| |
| |
Decision Trees | |
| |
| |
| |
Cart | |
| |
| |
| |
Number of Splits | |
| |
| |
| |
Query Selection and Node Impurity | |
| |
| |
| |
When to Stop Splitting | |
| |
| |
| |
Pruning | |
| |
| |
| |
Assignment of Leaf Node Labels | |
| |
| |
| |
A Simple Tree | |
| |
| |
| |
Computational Complexity | |
| |
| |
| |
Feature Choice | |
| |
| |
| |
Multivariate Decision Trees | |
| |
| |
| |
Priors and Costs | |
| |
| |
| |
Missing Attributes | |
| |
| |
| |
Surrogate Splits and Missing Attributes | |
| |
| |
| |
Other Tree Methods | |
| |
| |
| |
ID3 | |
| |
| |
| |
C4.5 | |
| |
| |
| |
Which Tree Classifier Is Best? | |
| |
| |
| |
Recognition with Strings | |
| |
| |
| |
String Matching | |
| |
| |
| |
Edit Distance | |
| |
| |
| |
Computational Complexity | |
| |
| |
| |
String Matching with Errors | |
| |
| |
| |
String Matching with the "Don't-Care" Symbol | |
| |
| |
| |
Grammatical Methods | |
| |
| |
| |
Grammars | |
| |
| |
| |
Types of String Grammars | |
| |
| |
| |
A Grammar for Pronouncing Numbers | |
| |
| |
| |
Recognition Using Grammars | |
| |
| |
| |
Grammatical Inference | |
| |
| |
| |
Grammatical Inference | |
| |
| |
| |
Rule-Based Methods | |
| |
| |
| |
Learning Rules | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Algorithm-Independent Machine Learning | |
| |
| |
| |
Introduction | |
| |
| |
| |
Lack of Inherent Superiority of Any Classifier | |
| |
| |
| |
No Free Lunch Theorem | |
| |
| |
| |
No Free Lunch for Binary Data | |
| |
| |
| |
Ugly Duckling Theorem | |
| |
| |
| |
Minimum Description Length (MDL) | |
| |
| |
| |
Minimum Description Length Principle | |
| |
| |
| |
Overfitting Avoidance and Occam's Razor | |
| |
| |
| |
Bias and Variance | |
| |
| |
| |
Bias and Variance for Regression | |
| |
| |
| |
Bias and Variance for Classification | |
| |
| |
| |
Resampling for Estimating Statistics | |
| |
| |
| |
Jackknife | |
| |
| |
| |
Jackknife Estimate of Bias and Variance of the Mode | |
| |
| |
| |
Bootstrap | |
| |
| |
| |
Resampling for Classifier Design | |
| |
| |
| |
Bagging | |
| |
| |
| |
Boosting | |
| |
| |
| |
Learning with Queries | |
| |
| |
| |
Arcing, Learning with Queries, Bias and Variance | |
| |
| |
| |
Estimating and Comparing Classifiers | |
| |
| |
| |
Parametric Models | |
| |
| |
| |
Cross-Validation | |
| |
| |
| |
Jackknife and Bootstrap Estimation of Classification Accuracy | |
| |
| |
| |
Maximum-Likelihood Model Comparison | |
| |
| |
| |
Bayesian Model Comparison | |
| |
| |
| |
The Problem-Average Error Rate | |
| |
| |
| |
Predicting Final Performance from Learning Curves | |
| |
| |
| |
The Capacity of a Separating Plane | |
| |
| |
| |
Combining Classifiers | |
| |
| |
| |
Component Classifiers with Discriminant Functions | |
| |
| |
| |
Component Classifiers without Discriminant Functions | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Unsupervised Learning and Clustering | |
| |
| |
| |
Introduction | |
| |
| |
| |
Mixture Densities and Identifiability | |
| |
| |
| |
Maximum-Likelihood Estimates | |
| |
| |
| |
Application to Normal Mixtures | |
| |
| |
| |
Case 1: Unknown Mean Vectors | |
| |
| |
| |
Case 2: All Parameters Unknown | |
| |
| |
| |
k-Means Clustering | |
| |
| |
| |
Fuzzy k-Means Clustering | |
| |
| |
| |
Unsupervised Bayesian Learning | |
| |
| |
| |
The Bayes Classifier | |
| |
| |
| |
Learning the Parameter Vector | |
| |
| |
| |
Unsupervised Learning of Gaussian Data | |
| |
| |
| |
Decision-Directed Approximation | |
| |
| |
| |
Data Description and Clustering | |
| |
| |
| |
Similarity Measures | |
| |
| |
| |
Criterion Functions for Clustering | |
| |
| |
| |
The Sum-of-Squared-Error Criterion | |
| |
| |
| |
Related Minimum Variance Criteria | |
| |
| |
| |
Scatter Criteria | |
| |
| |
| |
Clustering Criteria | |
| |
| |
| |
Iterative Optimization | |
| |
| |
| |
Hierarchical Clustering | |
| |
| |
| |
Definitions | |
| |
| |
| |
Agglomerative Hierarchical Clustering | |
| |
| |
| |
Stepwise-Optimal Hierarchical Clustering | |
| |
| |
| |
Hierarchical Clustering and Induced Metrics | |
| |
| |
| |
The Problem of Validity | |
| |
| |
| |
On-line clustering | |
| |
| |
| |
Unknown Number of Clusters | |
| |
| |
| |
Adaptive Resonance | |
| |
| |
| |
Learning with a Critic | |
| |
| |
| |
Graph-Theoretic Methods | |
| |
| |
| |
Component Analysis | |
| |
| |
| |
Principal Component Analysis (PCA) | |
| |
| |
| |
Nonlinear Component Analysis (NLCA) | |
| |
| |
| |
Independent Component Analysis (ICA) | |
| |
| |
| |
Low-Dimensional Representations and Multidimensional Scaling (MDS) | |
| |
| |
| |
Self-Organizing Feature Maps | |
| |
| |
| |
Clustering and Dimensionality Reduction | |
| |
| |
Summary | |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
Problems | |
| |
| |
Computer exercises | |
| |
| |
Bibliography | |
| |
| |
| |
Mathematical Foundations | |
| |
| |
| |
Notation | |
| |
| |
| |
Linear Algebra | |
| |
| |
| |
Notation and Preliminaries | |
| |
| |
| |
Inner Product | |
| |
| |
| |
Outer Product | |
| |
| |
| |
Derivatives of Matrices | |
| |
| |
| |
Determinant and Trace | |
| |
| |
| |
Matrix Inversion | |
| |
| |
| |
Eigenvectors and Eigenvalues | |
| |
| |
| |
Lagrange Optimization | |
| |
| |
| |
Probability Theory | |
| |
| |
| |
Discrete Random Variables | |
| |
| |
| |
Expected Values | |
| |
| |
| |
Pairs of Discrete Random Variables | |
| |
| |
| |
Statistical Independence | |
| |
| |
| |
Expected Values of Functions of Two Variables | |
| |
| |
| |
Conditional Probability | |
| |
| |
| |
The Law of Total Probability and Bayes' Rule | |
| |
| |
| |
Vector Random Variables | |
| |
| |
| |
Expectations, Mean Vectors and Covariance Matrices | |
| |
| |
| |
Continuous Random Variables | |
| |
| |
| |
Distributions of Sums of Independent Random Variables | |
| |
| |
| |
Normal Distributions | |
| |
| |
| |
Gaussian Derivatives and Integrals | |
| |
| |
| |
Multivariate Normal Densities | |
| |
| |
| |
Bivariate Normal Densities | |
| |
| |
| |
Hypothesis Testing | |
| |
| |
| |
Chi-Squared Test | |
| |
| |
| |
Information Theory | |
| |
| |
| |
Entropy and Information | |
| |
| |
| |
Relative Entropy | |
| |
| |
| |
Mutual Information | |
| |
| |
| |
Computational Complexity | |
| |
| |
Bibliography | |
| |
| |
Index | |