| |
| |
List of Figures | |
| |
| |
Series Foreword | |
| |
| |
Foreword | |
| |
| |
Preface | |
| |
| |
| |
Introductory Material | |
| |
| |
| |
Learning, Regularity, and Compression | |
| |
| |
| |
Regularity and Learning | |
| |
| |
| |
Regularity and Compression | |
| |
| |
| |
Solomonoff's Breakthrough - Kolmogorov Complexity | |
| |
| |
| |
Making the Idea Applicable | |
| |
| |
| |
Crude MDL, Refined MDL and Universal Coding | |
| |
| |
| |
From Crude to Refined MDL | |
| |
| |
| |
Universal Coding and Refined MDL | |
| |
| |
| |
Refined MDL for Model Selection | |
| |
| |
| |
Refined MDL for Prediction and Hypothesis Selection | |
| |
| |
| |
Some Remarks on Model Selection | |
| |
| |
| |
Model Selection among Non-Nested Models | |
| |
| |
| |
Goals of Model vs. Point Hypothesis Selection | |
| |
| |
| |
The MDL Philosophy | |
| |
| |
| |
MDL, Occam's Razor, and the "True Model" | |
| |
| |
| |
Answer to Criticism No. 1 | |
| |
| |
| |
Answer to Criticism No. 2 | |
| |
| |
| |
History and Forms of MDL | |
| |
| |
| |
What Is MDL? | |
| |
| |
| |
MDL Literature | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
Probabilistic and Statistical Preliminaries | |
| |
| |
| |
General Mathematical Preliminaries | |
| |
| |
| |
Probabilistic Preliminaries | |
| |
| |
| |
Definitions; Notational Conventions | |
| |
| |
| |
Probabilistic Sources | |
| |
| |
| |
Limit Theorems and Statements | |
| |
| |
| |
Probabilistic Models | |
| |
| |
| |
Probabilistic Model Classes | |
| |
| |
| |
Kinds of Probabilistic Models | |
| |
| |
| |
Terminological Preliminaries | |
| |
| |
| |
Modeling Preliminaries: Goals and Methods for Inductive Inference | |
| |
| |
| |
Consistency | |
| |
| |
| |
Basic Concepts of Bayesian Statistics | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
Information-Theoretic Preliminaries | |
| |
| |
| |
Coding Preliminaries | |
| |
| |
| |
Restriction to Prefix Coding Systems; Descriptions as Messages | |
| |
| |
| |
Different Kinds of Codes | |
| |
| |
| |
Assessing the Efficiency of Description Methods | |
| |
| |
| |
The Most Important Section of This Book: Probabilities and Code Lengths | |
| |
| |
| |
The Kraft Inequality | |
| |
| |
| |
Code Lengths "Are" Probabilities | |
| |
| |
| |
Immediate Insights and Consequences | |
| |
| |
| |
Probabilities and Code Lengths, Part II | |
| |
| |
| |
(Relative) Entropy and the Information Inequality | |
| |
| |
| |
Uniform Codes, Maximum Entropy, and Minimax Codelength | |
| |
| |
| |
Summary, Outlook, Further Reading | |
| |
| |
| |
Information-Theoretic Properties of Statistical Models | |
| |
| |
| |
Introduction | |
| |
| |
| |
Likelihood and Observed Fisher Information | |
| |
| |
| |
KL Divergence and Expected Fisher Information | |
| |
| |
| |
Maximum Likelihood: Data vs. Parameters | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
Crude Two-Part Code MDL | |
| |
| |
| |
Introduction: Making Two-Part MDL Precise | |
| |
| |
| |
Two-Part Code MDL for Markov Chain Selection | |
| |
| |
| |
The Code C[subscript 2] | |
| |
| |
| |
The Code C[subscript 1] | |
| |
| |
| |
Crude Two-Part Code MDL for Markov Chains | |
| |
| |
| |
Simplistic Two-Part Code MDL Hypothesis Selection | |
| |
| |
| |
Two-Part MDL for Tasks Other Than Hypothesis Selection | |
| |
| |
| |
Behavior of Two-Part Code MDL | |
| |
| |
| |
Two-Part Code MDL and Maximum Likelihood | |
| |
| |
| |
The Maximum Likelihood Principle | |
| |
| |
| |
MDL vs. ML | |
| |
| |
| |
MDL as a Maximum Probability Principle | |
| |
| |
| |
Computing and Approximating Two-Part MDL in Practice | |
| |
| |
| |
Justifying Crude MDL: Consistency and Code Design | |
| |
| |
| |
A General Consistency Result | |
| |
| |
| |
Code Design for Two-Part Code MDL | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
Appendix: Proof of Theorem 5.1 | |
| |
| |
| |
Universal Coding | |
| |
| |
| |
Universal Coding with Countable Models | |
| |
| |
| |
Universal Coding: The Basic Idea | |
| |
| |
| |
Two-part Codes as Simple Universal Codes | |
| |
| |
| |
From Universal Codes to Universal Models | |
| |
| |
| |
Formal Definition of Universality | |
| |
| |
| |
The Finite Case | |
| |
| |
| |
Minimax Regret and Normalized ML | |
| |
| |
| |
NML vs. Two-Part vs. Bayes | |
| |
| |
| |
The Countably Infinite Case | |
| |
| |
| |
The Two-Part and Bayesian Codes | |
| |
| |
| |
The NML Code | |
| |
| |
| |
Prequential Universal Models | |
| |
| |
| |
Distributions as Prediction Strategies | |
| |
| |
| |
Bayes Is Prequential; NML and Two-part Are Not | |
| |
| |
| |
The Prequential Plug-In Model | |
| |
| |
| |
Individual vs. Stochastic Universality | |
| |
| |
| |
Stochastic Redundancy | |
| |
| |
| |
Uniformly Universal Models | |
| |
| |
| |
Summary, Outlook and Further Reading | |
| |
| |
| |
Parametric Models: Normalized Maximum Likelihood | |
| |
| |
| |
Introduction | |
| |
| |
| |
Preliminaries | |
| |
| |
| |
Asymptotic Expansion of Parametric Complexity | |
| |
| |
| |
The Meaning of [Characters not reproducible] | |
| |
| |
| |
Complexity and Functional Form | |
| |
| |
| |
KL Divergence and Distinguishability | |
| |
| |
| |
Complexity and Volume | |
| |
| |
| |
Complexity and the Number of Distinguishable Distributions | |
| |
| |
| |
Explicit and Simplified Computations | |
| |
| |
| |
Parametric Models: Bayes | |
| |
| |
| |
The Bayesian Regret | |
| |
| |
| |
Basic Interpretation of Theorem 8.1 | |
| |
| |
| |
Bayes Meets Minimax - Jeffreys' Prior | |
| |
| |
| |
Jeffreys' Prior and the Boundary | |
| |
| |
| |
How to Prove the Bayesian and NML Regret Theorems | |
| |
| |
| |
Proof Sketch of Theorem 8.1 | |
| |
| |
| |
Beyond Exponential Families | |
| |
| |
| |
Proof Sketch of Theorem 7.1 | |
| |
| |
| |
Stochastic Universality | |
| |
| |
| |
Appendix: Proofs of Theorem 8.1 and Theorem 8.2 | |
| |
| |
| |
Parametric Models: Prequential Plug-in | |
| |
| |
| |
Prequential Plug-in for Exponential Families | |
| |
| |
| |
The Plug-in vs. the Bayes Universal Model | |
| |
| |
| |
More Precise Asymptotics | |
| |
| |
| |
Summary | |
| |
| |
| |
Parametric Models: Two-Part | |
| |
| |
| |
The Ordinary Two-Part Universal Model | |
| |
| |
| |
Derivation of the Two-Part Code Regret | |
| |
| |
| |
Proof Sketch of Theorem 10.1 | |
| |
| |
| |
Discussion | |
| |
| |
| |
The Conditional Two-Part Universal Code | |
| |
| |
| |
Conditional Two-Part Codes for Discrete Exponential Families | |
| |
| |
| |
Distinguishability and the Phase Transition | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
NML With Infinite Complexity | |
| |
| |
| |
Introduction | |
| |
| |
| |
Examples of Undefined NML Distribution | |
| |
| |
| |
Examples of Undefined Jeffreys' Prior | |
| |
| |
| |
Metauniversal Codes | |
| |
| |
| |
Constrained Parametric Complexity | |
| |
| |
| |
Meta-Two-Part Coding | |
| |
| |
| |
Renormalized Maximum Likelihood | |
| |
| |
| |
NML with Luckiness | |
| |
| |
| |
Asymptotic Expansion of LNML | |
| |
| |
| |
Conditional Universal Models | |
| |
| |
| |
Bayesian Approach with Jeffreys' Prior | |
| |
| |
| |
Conditional NML | |
| |
| |
| |
Liang and Barron's Approach | |
| |
| |
| |
Summary and Remarks | |
| |
| |
| |
Appendix: Proof of Theorem 11.4 | |
| |
| |
| |
Linear Regression | |
| |
| |
| |
Introduction | |
| |
| |
| |
Prelude: The Normal Location Family | |
| |
| |
| |
Least-Squares Estimation | |
| |
| |
| |
The Normal Equations | |
| |
| |
| |
Composition of Experiments | |
| |
| |
| |
Penalized Least-Squares | |
| |
| |
| |
The Linear Model | |
| |
| |
| |
Bayesian Linea Model M[superscript X] with Gaussian Prior | |
| |
| |
| |
Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors | |
| |
| |
| |
Universal Models for Linear Regression | |
| |
| |
| |
NML | |
| |
| |
| |
Bayes and LNML | |
| |
| |
| |
Bayes-Jeffreys and CNML | |
| |
| |
| |
Beyond Parametrics | |
| |
| |
| |
Introduction | |
| |
| |
| |
CUP: Unions of Parametric Models | |
| |
| |
| |
CUP vs. Parametric Models | |
| |
| |
| |
Universal Codes Based on Histograms | |
| |
| |
| |
Redundancy of Universal CUP Histogram Codes | |
| |
| |
| |
Nonparametric Redundancy | |
| |
| |
| |
Standard CUP Universal Codes | |
| |
| |
| |
Minimax Nonparametric Redundancy | |
| |
| |
| |
Gaussian Process Regression | |
| |
| |
| |
Kernelization of Bayesian Linear Regression | |
| |
| |
| |
Gaussian Processes | |
| |
| |
| |
Gaussian Processes as Universal Models | |
| |
| |
| |
Conclusion and Further Reading | |
| |
| |
| |
Refined MDL | |
| |
| |
| |
MDL Model Selection | |
| |
| |
| |
Introduction | |
| |
| |
| |
Simple Refined MDL Model Selection | |
| |
| |
| |
Compression Interpretation | |
| |
| |
| |
Counting Interpretation | |
| |
| |
| |
Bayesian Interpretation | |
| |
| |
| |
Prequential Interpretation | |
| |
| |
| |
General Parametric Model Selection | |
| |
| |
| |
Models with Infinite Complexities | |
| |
| |
| |
Comparing Many or Infinitely Many Models | |
| |
| |
| |
The General Picture | |
| |
| |
| |
Practical Issues in MDL Model Selection | |
| |
| |
| |
Calculating Universal Codelengths | |
| |
| |
| |
Computational Efficiency and Practical Quality of Non-NML Universal Codes | |
| |
| |
| |
Model Selection with Conditional NML and Plug-in Codes | |
| |
| |
| |
General Warnings about Model Selection | |
| |
| |
| |
MDL Model Selection for Linear Regression | |
| |
| |
| |
Rissanen's RNML Approach | |
| |
| |
| |
Hansen and Yu's gMDL Approach | |
| |
| |
| |
Liang and Barron's Approach | |
| |
| |
| |
Discussion | |
| |
| |
| |
Worst Case vs. Average Case | |
| |
| |
| |
MDL Prediction and Estimation | |
| |
| |
| |
Introduction | |
| |
| |
| |
MDL for Prediction and Predictive Estimation | |
| |
| |
| |
Prequential MDL Estimators | |
| |
| |
| |
Prequential MDL Estimators Are Consistent | |
| |
| |
| |
Parametric and Nonparametric Examples | |
| |
| |
| |
Cesaro KL consistency vs. KL consistency | |
| |
| |
| |
Two-Part Code MDL for Point Hypothesis Selection | |
| |
| |
| |
Discussion of Two-Part Consistency Theorem | |
| |
| |
| |
MDL Parameter Estimation | |
| |
| |
| |
MDL Estimators vs. Luckiness ML Estimators | |
| |
| |
| |
What Estimator To Use? | |
| |
| |
| |
Comparison to Bayesian Estimators | |
| |
| |
| |
Summary and Outlook | |
| |
| |
| |
Appendix: Proof of Theorem 15.3 | |
| |
| |
| |
MDL Consistency and Convergence | |
| |
| |
| |
Introduction | |
| |
| |
| |
The Scenarios Considered | |
| |
| |
| |
Consistency: Prequential and Two-Part MDL Estimators | |
| |
| |
| |
Consistency: MDL Model Selection | |
| |
| |
| |
Selection between a Union of Parametric Models | |
| |
| |
| |
Nonparametric Model Selection Based on CUP Model Class | |
| |
| |
| |
MDL Consistency Peculiarities | |
| |
| |
| |
Risks and Rates | |
| |
| |
| |
Relations between Divergences and Risk Measures | |
| |
| |
| |
Minimax Rates | |
| |
| |
| |
MDL Rates of Convergence | |
| |
| |
| |
Prequential and Two-Part MDL Estimators | |
| |
| |
| |
MDL Model Selection | |
| |
| |
| |
MDL in Context | |
| |
| |
| |
MDL and Frequentist Paradigms | |
| |
| |
| |
Sanity Check or Design Principle? | |
| |
| |
| |
The Weak Prequential Principle | |
| |
| |
| |
MDL vs. Frequentist Principles: Remaining Issues | |
| |
| |
| |
MDL and Bayesian Inference | |
| |
| |
| |
Luckiness Functions vs. Prior Distributions | |
| |
| |
| |
MDL, Bayes, and Occam | |
| |
| |
| |
MDL and Brands of Bayesian Statistics | |
| |
| |
| |
Conclusion: a Common Future after All? | |
| |
| |
| |
MDL, AIC and BIC | |
| |
| |
| |
BIC | |
| |
| |
| |
AIC | |
| |
| |
| |
Combining the Best of AIC and BIC | |
| |
| |
| |
MDL and MML | |
| |
| |
| |
Strict Minimum Message Length | |
| |
| |
| |
Comparison to MDL | |
| |
| |
| |
The Wallace-Freeman Estimator | |
| |
| |
| |
MDL and Prequential Analysis | |
| |
| |
| |
MDL and Cross-Validation | |
| |
| |
| |
MDL and Maximum Entropy | |
| |
| |
| |
Kolmogorov Complexity and Structure Function | |
| |
| |
| |
MDL and Individual Sequence Prediction | |
| |
| |
| |
MDL and Statistical Learning Theory | |
| |
| |
| |
Structural Risk Minimization | |
| |
| |
| |
PAC-Bayesian Approaches | |
| |
| |
| |
PAC-Bayesand MDL | |
| |
| |
| |
The Road Ahead | |
| |
| |
| |
Additional Background | |
| |
| |
| |
The Exponential or "Maximum Entropy" Families | |
| |
| |
| |
Introduction | |
| |
| |
| |
Definition and Overview | |
| |
| |
| |
Basic Properties | |
| |
| |
| |
Mean-Value, Canonical, and Other Parameterizations | |
| |
| |
| |
The Mean Value Parameterization | |
| |
| |
| |
Other Parameterizations | |
| |
| |
| |
Relating Mean-Value and Canonical Parameters | |
| |
| |
| |
Exponential Families of General Probabilistic Sources | |
| |
| |
| |
Fisher Information Definitions and Characterizations | |
| |
| |
| |
Information-Theoretic Properties of Exponential Families | |
| |
| |
| |
Introduction | |
| |
| |
| |
Robustness of Exponential Family Codes | |
| |
| |
| |
If [Characters not reproducible][subscript mean] Does Not Contain the Mean | |
| |
| |
| |
Behavior at the ML Estimate [Beta] | |
| |
| |
| |
Behavior of the ML Estimate [Beta] | |
| |
| |
| |
Central Limit Theorem | |
| |
| |
| |
Large Deviations | |
| |
| |
| |
Maximum Entropy and Minimax Codelength | |
| |
| |
| |
Exponential Families and Maximum Entropy | |
| |
| |
| |
Exponential Families and Minimax Codelength | |
| |
| |
| |
The Compression Game | |
| |
| |
| |
Likelihood Ratio Families and Renyi Divergences | |
| |
| |
| |
The Likelihood Ratio Family | |
| |
| |
| |
Summary | |
| |
| |
References | |
| |
| |
List of Symbols | |
| |
| |
Subject Index | |