| |

| |

Preface | |

| |

| |

| |

Introduction | |

| |

| |

| |

Machine Perception | |

| |

| |

| |

An Example | |

| |

| |

| |

Related Fields | |

| |

| |

| |

Pattern Recognition Systems | |

| |

| |

| |

Sensing | |

| |

| |

| |

Segmentation and Grouping | |

| |

| |

| |

Feature Extraction | |

| |

| |

| |

Classification | |

| |

| |

| |

Post Processing | |

| |

| |

| |

The Design Cycle | |

| |

| |

| |

Data Collection | |

| |

| |

| |

Feature Choice | |

| |

| |

| |

Model Choice | |

| |

| |

| |

Training | |

| |

| |

| |

Evaluation | |

| |

| |

| |

Computational Complexity | |

| |

| |

| |

Learning and Adaptation | |

| |

| |

| |

Supervised Learning | |

| |

| |

| |

Unsupervised Learning | |

| |

| |

| |

Reinforcement Learning | |

| |

| |

| |

Conclusion | |

| |

| |

Summary by Chapters | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Bibliography | |

| |

| |

| |

Bayesian Decision Theory | |

| |

| |

| |

Introduction | |

| |

| |

| |

Bayesian Decision Theory--Continuous Features | |

| |

| |

| |

Two-Category Classification | |

| |

| |

| |

Minimum-Error-Rate Classification | |

| |

| |

| |

Minimax Criterion | |

| |

| |

| |

Neyman-Pearson Criterion | |

| |

| |

| |

Classifiers, Discriminant Functions, and Decision Surfaces | |

| |

| |

| |

The Multicategory Case | |

| |

| |

| |

The Two-Category Case | |

| |

| |

| |

The Normal Density | |

| |

| |

| |

Univariate Density | |

| |

| |

| |

Multivariate Density | |

| |

| |

| |

Discriminant Functions for the Normal Density | |

| |

| |

| |

Case 1: [Sigma subscript i] = [sigma superscript 2]I | |

| |

| |

| |

Case 2: [Sigma ubscript i] = [Sigma] | |

| |

| |

| |

Case 3: [Sigma subscript i] = arbitrary | |

| |

| |

| |

Decision Regions for Two-Dimensional Gaussian Data | |

| |

| |

| |

Error Probabilities and Integrals | |

| |

| |

| |

Error Bounds for Normal Densities | |

| |

| |

| |

Chernoff Bound | |

| |

| |

| |

Bhattacharyya Bound | |

| |

| |

| |

Error Bounds for Gaussian Distributions | |

| |

| |

| |

Signal Detection Theory and Operating Characteristics | |

| |

| |

| |

Bayes Decision Theory--Discrete Features | |

| |

| |

| |

Independent Binary Features | |

| |

| |

| |

Bayesian Decisions for Three-Dimensional Binary Data | |

| |

| |

| |

Missing and Noisy Features | |

| |

| |

| |

Missing Features | |

| |

| |

| |

Noisy Features | |

| |

| |

| |

Bayesian Belief Networks | |

| |

| |

| |

Belief Network for Fish | |

| |

| |

| |

Compound Bayesian Decision Theory and Context | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Maximum-Likelihood and Bayesian Parameter Estimation | |

| |

| |

| |

Introduction | |

| |

| |

| |

Maximum-Likelihood Estimation | |

| |

| |

| |

The General Principle | |

| |

| |

| |

The Gaussian Case: Unknown [mu] | |

| |

| |

| |

The Gaussian Case: Unknown [mu] and [Sigma] | |

| |

| |

| |

Bias | |

| |

| |

| |

Bayesian Estimation | |

| |

| |

| |

The Class-Conditional Densities | |

| |

| |

| |

The Parameter Distribution | |

| |

| |

| |

Bayesian Parameter Estimation: Gaussian Case | |

| |

| |

| |

The Univariate Case: p([mu]D) | |

| |

| |

| |

The Univariate Case: p(xD) | |

| |

| |

| |

The Multivariate Case | |

| |

| |

| |

Bayesian Parameter Estimation: General Theory | |

| |

| |

| |

Recursive Bayes Learning | |

| |

| |

| |

When Do Maximum-Likelihood and Bayes Methods Differ? | |

| |

| |

| |

Noninformative Priors and Invariance | |

| |

| |

| |

Gibbs Algorithm | |

| |

| |

| |

Sufficient Statistics | |

| |

| |

| |

Sufficient Statistics and the Exponential Family | |

| |

| |

| |

Problems of Dimensionality | |

| |

| |

| |

Accuracy, Dimension, and Training Sample Size | |

| |

| |

| |

Computational Complexity | |

| |

| |

| |

Overfitting | |

| |

| |

| |

Component Analysis and Discriminants | |

| |

| |

| |

Principal Component Analysis (PCA) | |

| |

| |

| |

Fisher Linear Discriminant | |

| |

| |

| |

Multiple Discriminant Analysis | |

| |

| |

| |

Expectation-Maximization (EM) | |

| |

| |

| |

Expectation-Maximization for a 2D Normal Model | |

| |

| |

| |

Hidden Markov Models | |

| |

| |

| |

First-Order Markov Models | |

| |

| |

| |

First-Order Hidden Markov Models | |

| |

| |

| |

Hidden Markov Model Computation | |

| |

| |

| |

Evaluation | |

| |

| |

| |

Hidden Markov Model | |

| |

| |

| |

Decoding | |

| |

| |

| |

HMM Decoding | |

| |

| |

| |

Learning | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Nonparametric Techniques | |

| |

| |

| |

Introduction | |

| |

| |

| |

Density Estimation | |

| |

| |

| |

Parzen Windows | |

| |

| |

| |

Convergence of the Mean | |

| |

| |

| |

Convergence of the Variance | |

| |

| |

| |

Illustrations | |

| |

| |

| |

Classification Example | |

| |

| |

| |

Probabilistic Neural Networks (PNNs) | |

| |

| |

| |

Choosing the Window Function | |

| |

| |

| |

k[subscript n]-Nearest-Neighbor Estimation | |

| |

| |

| |

k[subscript n]-Nearest-Neighbor and Parzen-Window Estimation | |

| |

| |

| |

Estimation of A Posteriori Probabilities | |

| |

| |

| |

The Nearest-Neighbor Rule | |

| |

| |

| |

Convergence of the Nearest Neighbor | |

| |

| |

| |

Error Rate for the Nearest-Neighbor Rule | |

| |

| |

| |

Error Bounds | |

| |

| |

| |

The k-Nearest-Neighbor Rule | |

| |

| |

| |

Computational Complexity of the k-Nearest-Neighbor Rule | |

| |

| |

| |

Metrics and Nearest-Neighbor Classification | |

| |

| |

| |

Properties of Metrics | |

| |

| |

| |

Tangent Distance | |

| |

| |

| |

Fuzzy Classification | |

| |

| |

| |

Reduced Coulomb Energy Networks | |

| |

| |

| |

Approximations by Series Expansions | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Linear Discriminant Functions | |

| |

| |

| |

Introduction | |

| |

| |

| |

Linear Discriminant Functions and Decision Surfaces | |

| |

| |

| |

The Two-Category Case | |

| |

| |

| |

The Multicategory Case | |

| |

| |

| |

Generalized Linear Discriminant Functions | |

| |

| |

| |

The Two-Category Linearly Separable Case | |

| |

| |

| |

Geometry and Terminology | |

| |

| |

| |

Gradient Descent Procedures | |

| |

| |

| |

Minimizing the Perceptron Criterion Function | |

| |

| |

| |

The Perceptron Criterion Function | |

| |

| |

| |

Convergence Proof for Single-Sample Correction | |

| |

| |

| |

Some Direct Generalizations | |

| |

| |

| |

Relaxation Procedures | |

| |

| |

| |

The Descent Algorithm | |

| |

| |

| |

Convergence Proof | |

| |

| |

| |

Nonseparable Behavior | |

| |

| |

| |

Minimum Squared-Error Procedures | |

| |

| |

| |

Minimum Squared-Error and the Pseudoinverse | |

| |

| |

| |

Constructing a Linear Classifier by Matrix Pseudoinverse | |

| |

| |

| |

Relation to Fisher's Linear Discriminant | |

| |

| |

| |

Asymptotic Approximation to an Optimal Discriminant | |

| |

| |

| |

The Widrow-Hoff or LMS Procedure | |

| |

| |

| |

Stochastic Approximation Methods | |

| |

| |

| |

The Ho-Kashyap Procedures | |

| |

| |

| |

The Descent Procedure | |

| |

| |

| |

Convergence Proof | |

| |

| |

| |

Nonseparable Behavior | |

| |

| |

| |

Some Related Procedures | |

| |

| |

| |

Linear Programming Algorithms | |

| |

| |

| |

Linear Programming | |

| |

| |

| |

The Linearly Separable Case | |

| |

| |

| |

Minimizing the Perceptron Criterion Function | |

| |

| |

| |

Support Vector Machines | |

| |

| |

| |

SVM Training | |

| |

| |

| |

SVM for the XOR Problem | |

| |

| |

| |

Multicategory Generalizations | |

| |

| |

| |

Kesler's Construction | |

| |

| |

| |

Convergence of the Fixed-Increment Rule | |

| |

| |

| |

Generalizations for MSE Procedures | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Multilayer Neural Networks | |

| |

| |

| |

Introduction | |

| |

| |

| |

Feedforward Operation and Classification | |

| |

| |

| |

General Feedforward Operation | |

| |

| |

| |

Expressive Power of Multilayer Networks | |

| |

| |

| |

Backpropagation Algorithm | |

| |

| |

| |

Network Learning | |

| |

| |

| |

Training Protocols | |

| |

| |

| |

Learning Curves | |

| |

| |

| |

Error Surfaces | |

| |

| |

| |

Some Small Networks | |

| |

| |

| |

The Exclusive-OR (XOR) | |

| |

| |

| |

Larger Networks | |

| |

| |

| |

How Important Are Multiple Minima? | |

| |

| |

| |

Backpropagation as Feature Mapping | |

| |

| |

| |

Representations at the Hidden Layer--Weights | |

| |

| |

| |

Backpropagation, Bayes Theory and Probability | |

| |

| |

| |

Bayes Discriminants and Neural Networks | |

| |

| |

| |

Outputs as Probabilities | |

| |

| |

| |

Related Statistical Techniques | |

| |

| |

| |

Practical Techniques for Improving Backpropagation | |

| |

| |

| |

Activation Function | |

| |

| |

| |

Parameters for the Sigmoid | |

| |

| |

| |

Scaling Input | |

| |

| |

| |

Target Values | |

| |

| |

| |

Training with Noise | |

| |

| |

| |

Manufacturing Data | |

| |

| |

| |

Number of Hidden Units | |

| |

| |

| |

Initializing Weights | |

| |

| |

| |

Learning Rates | |

| |

| |

| |

Momentum | |

| |

| |

| |

Weight Decay | |

| |

| |

| |

Hints | |

| |

| |

| |

On-Line, Stochastic or Batch Training? | |

| |

| |

| |

Stopped Training | |

| |

| |

| |

Number of Hidden Layers | |

| |

| |

| |

Criterion Function | |

| |

| |

| |

Second-Order Methods | |

| |

| |

| |

Hessian Matrix | |

| |

| |

| |

Newton's Method | |

| |

| |

| |

Quickprop | |

| |

| |

| |

Conjugate Gradient Descent | |

| |

| |

| |

Conjugate Gradient Descent | |

| |

| |

| |

Additional Networks and Training Methods | |

| |

| |

| |

Radial Basis Function Networks (RBFs) | |

| |

| |

| |

Special Bases | |

| |

| |

| |

Matched Filters | |

| |

| |

| |

Convolutional Networks | |

| |

| |

| |

Recurrent Networks | |

| |

| |

| |

Cascade-Correlation | |

| |

| |

| |

Regularization, Complexity Adjustment and Pruning | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Stochastic Methods | |

| |

| |

| |

Introduction | |

| |

| |

| |

Stochastic Search | |

| |

| |

| |

Simulated Annealing | |

| |

| |

| |

The Boltzmann Factor | |

| |

| |

| |

Deterministic Simulated Annealing | |

| |

| |

| |

Boltzmann Learning | |

| |

| |

| |

Stochastic Boltzmann Learning of Visible States | |

| |

| |

| |

Missing Features and Category Constraints | |

| |

| |

| |

Deterministic Boltzmann Learning | |

| |

| |

| |

Initialization and Setting Parameters | |

| |

| |

| |

Boltzmann Networks and Graphical Models | |

| |

| |

| |

Other Graphical Models | |

| |

| |

| |

Evolutionary Methods | |

| |

| |

| |

Genetic Algorithms | |

| |

| |

| |

Further Heuristics | |

| |

| |

| |

Why Do They Work? | |

| |

| |

| |

Genetic Programming | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Nonmetric Methods | |

| |

| |

| |

Introduction | |

| |

| |

| |

Decision Trees | |

| |

| |

| |

Cart | |

| |

| |

| |

Number of Splits | |

| |

| |

| |

Query Selection and Node Impurity | |

| |

| |

| |

When to Stop Splitting | |

| |

| |

| |

Pruning | |

| |

| |

| |

Assignment of Leaf Node Labels | |

| |

| |

| |

A Simple Tree | |

| |

| |

| |

Computational Complexity | |

| |

| |

| |

Feature Choice | |

| |

| |

| |

Multivariate Decision Trees | |

| |

| |

| |

Priors and Costs | |

| |

| |

| |

Missing Attributes | |

| |

| |

| |

Surrogate Splits and Missing Attributes | |

| |

| |

| |

Other Tree Methods | |

| |

| |

| |

ID3 | |

| |

| |

| |

C4.5 | |

| |

| |

| |

Which Tree Classifier Is Best? | |

| |

| |

| |

Recognition with Strings | |

| |

| |

| |

String Matching | |

| |

| |

| |

Edit Distance | |

| |

| |

| |

Computational Complexity | |

| |

| |

| |

String Matching with Errors | |

| |

| |

| |

String Matching with the "Don't-Care" Symbol | |

| |

| |

| |

Grammatical Methods | |

| |

| |

| |

Grammars | |

| |

| |

| |

Types of String Grammars | |

| |

| |

| |

A Grammar for Pronouncing Numbers | |

| |

| |

| |

Recognition Using Grammars | |

| |

| |

| |

Grammatical Inference | |

| |

| |

| |

Grammatical Inference | |

| |

| |

| |

Rule-Based Methods | |

| |

| |

| |

Learning Rules | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Algorithm-Independent Machine Learning | |

| |

| |

| |

Introduction | |

| |

| |

| |

Lack of Inherent Superiority of Any Classifier | |

| |

| |

| |

No Free Lunch Theorem | |

| |

| |

| |

No Free Lunch for Binary Data | |

| |

| |

| |

Ugly Duckling Theorem | |

| |

| |

| |

Minimum Description Length (MDL) | |

| |

| |

| |

Minimum Description Length Principle | |

| |

| |

| |

Overfitting Avoidance and Occam's Razor | |

| |

| |

| |

Bias and Variance | |

| |

| |

| |

Bias and Variance for Regression | |

| |

| |

| |

Bias and Variance for Classification | |

| |

| |

| |

Resampling for Estimating Statistics | |

| |

| |

| |

Jackknife | |

| |

| |

| |

Jackknife Estimate of Bias and Variance of the Mode | |

| |

| |

| |

Bootstrap | |

| |

| |

| |

Resampling for Classifier Design | |

| |

| |

| |

Bagging | |

| |

| |

| |

Boosting | |

| |

| |

| |

Learning with Queries | |

| |

| |

| |

Arcing, Learning with Queries, Bias and Variance | |

| |

| |

| |

Estimating and Comparing Classifiers | |

| |

| |

| |

Parametric Models | |

| |

| |

| |

Cross-Validation | |

| |

| |

| |

Jackknife and Bootstrap Estimation of Classification Accuracy | |

| |

| |

| |

Maximum-Likelihood Model Comparison | |

| |

| |

| |

Bayesian Model Comparison | |

| |

| |

| |

The Problem-Average Error Rate | |

| |

| |

| |

Predicting Final Performance from Learning Curves | |

| |

| |

| |

The Capacity of a Separating Plane | |

| |

| |

| |

Combining Classifiers | |

| |

| |

| |

Component Classifiers with Discriminant Functions | |

| |

| |

| |

Component Classifiers without Discriminant Functions | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Unsupervised Learning and Clustering | |

| |

| |

| |

Introduction | |

| |

| |

| |

Mixture Densities and Identifiability | |

| |

| |

| |

Maximum-Likelihood Estimates | |

| |

| |

| |

Application to Normal Mixtures | |

| |

| |

| |

Case 1: Unknown Mean Vectors | |

| |

| |

| |

Case 2: All Parameters Unknown | |

| |

| |

| |

k-Means Clustering | |

| |

| |

| |

Fuzzy k-Means Clustering | |

| |

| |

| |

Unsupervised Bayesian Learning | |

| |

| |

| |

The Bayes Classifier | |

| |

| |

| |

Learning the Parameter Vector | |

| |

| |

| |

Unsupervised Learning of Gaussian Data | |

| |

| |

| |

Decision-Directed Approximation | |

| |

| |

| |

Data Description and Clustering | |

| |

| |

| |

Similarity Measures | |

| |

| |

| |

Criterion Functions for Clustering | |

| |

| |

| |

The Sum-of-Squared-Error Criterion | |

| |

| |

| |

Related Minimum Variance Criteria | |

| |

| |

| |

Scatter Criteria | |

| |

| |

| |

Clustering Criteria | |

| |

| |

| |

Iterative Optimization | |

| |

| |

| |

Hierarchical Clustering | |

| |

| |

| |

Definitions | |

| |

| |

| |

Agglomerative Hierarchical Clustering | |

| |

| |

| |

Stepwise-Optimal Hierarchical Clustering | |

| |

| |

| |

Hierarchical Clustering and Induced Metrics | |

| |

| |

| |

The Problem of Validity | |

| |

| |

| |

On-line clustering | |

| |

| |

| |

Unknown Number of Clusters | |

| |

| |

| |

Adaptive Resonance | |

| |

| |

| |

Learning with a Critic | |

| |

| |

| |

Graph-Theoretic Methods | |

| |

| |

| |

Component Analysis | |

| |

| |

| |

Principal Component Analysis (PCA) | |

| |

| |

| |

Nonlinear Component Analysis (NLCA) | |

| |

| |

| |

Independent Component Analysis (ICA) | |

| |

| |

| |

Low-Dimensional Representations and Multidimensional Scaling (MDS) | |

| |

| |

| |

Self-Organizing Feature Maps | |

| |

| |

| |

Clustering and Dimensionality Reduction | |

| |

| |

Summary | |

| |

| |

Bibliographical and Historical Remarks | |

| |

| |

Problems | |

| |

| |

Computer exercises | |

| |

| |

Bibliography | |

| |

| |

| |

Mathematical Foundations | |

| |

| |

| |

Notation | |

| |

| |

| |

Linear Algebra | |

| |

| |

| |

Notation and Preliminaries | |

| |

| |

| |

Inner Product | |

| |

| |

| |

Outer Product | |

| |

| |

| |

Derivatives of Matrices | |

| |

| |

| |

Determinant and Trace | |

| |

| |

| |

Matrix Inversion | |

| |

| |

| |

Eigenvectors and Eigenvalues | |

| |

| |

| |

Lagrange Optimization | |

| |

| |

| |

Probability Theory | |

| |

| |

| |

Discrete Random Variables | |

| |

| |

| |

Expected Values | |

| |

| |

| |

Pairs of Discrete Random Variables | |

| |

| |

| |

Statistical Independence | |

| |

| |

| |

Expected Values of Functions of Two Variables | |

| |

| |

| |

Conditional Probability | |

| |

| |

| |

The Law of Total Probability and Bayes' Rule | |

| |

| |

| |

Vector Random Variables | |

| |

| |

| |

Expectations, Mean Vectors and Covariance Matrices | |

| |

| |

| |

Continuous Random Variables | |

| |

| |

| |

Distributions of Sums of Independent Random Variables | |

| |

| |

| |

Normal Distributions | |

| |

| |

| |

Gaussian Derivatives and Integrals | |

| |

| |

| |

Multivariate Normal Densities | |

| |

| |

| |

Bivariate Normal Densities | |

| |

| |

| |

Hypothesis Testing | |

| |

| |

| |

Chi-Squared Test | |

| |

| |

| |

Information Theory | |

| |

| |

| |

Entropy and Information | |

| |

| |

| |

Relative Entropy | |

| |

| |

| |

Mutual Information | |

| |

| |

| |

Computational Complexity | |

| |

| |

Bibliography | |

| |

| |

Index | |