| |

| |

List of Figures | |

| |

| |

Series Foreword | |

| |

| |

Foreword | |

| |

| |

Preface | |

| |

| |

| |

Introductory Material | |

| |

| |

| |

Learning, Regularity, and Compression | |

| |

| |

| |

Regularity and Learning | |

| |

| |

| |

Regularity and Compression | |

| |

| |

| |

Solomonoff's Breakthrough - Kolmogorov Complexity | |

| |

| |

| |

Making the Idea Applicable | |

| |

| |

| |

Crude MDL, Refined MDL and Universal Coding | |

| |

| |

| |

From Crude to Refined MDL | |

| |

| |

| |

Universal Coding and Refined MDL | |

| |

| |

| |

Refined MDL for Model Selection | |

| |

| |

| |

Refined MDL for Prediction and Hypothesis Selection | |

| |

| |

| |

Some Remarks on Model Selection | |

| |

| |

| |

Model Selection among Non-Nested Models | |

| |

| |

| |

Goals of Model vs. Point Hypothesis Selection | |

| |

| |

| |

The MDL Philosophy | |

| |

| |

| |

MDL, Occam's Razor, and the "True Model" | |

| |

| |

| |

Answer to Criticism No. 1 | |

| |

| |

| |

Answer to Criticism No. 2 | |

| |

| |

| |

History and Forms of MDL | |

| |

| |

| |

What Is MDL? | |

| |

| |

| |

MDL Literature | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

Probabilistic and Statistical Preliminaries | |

| |

| |

| |

General Mathematical Preliminaries | |

| |

| |

| |

Probabilistic Preliminaries | |

| |

| |

| |

Definitions; Notational Conventions | |

| |

| |

| |

Probabilistic Sources | |

| |

| |

| |

Limit Theorems and Statements | |

| |

| |

| |

Probabilistic Models | |

| |

| |

| |

Probabilistic Model Classes | |

| |

| |

| |

Kinds of Probabilistic Models | |

| |

| |

| |

Terminological Preliminaries | |

| |

| |

| |

Modeling Preliminaries: Goals and Methods for Inductive Inference | |

| |

| |

| |

Consistency | |

| |

| |

| |

Basic Concepts of Bayesian Statistics | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

Information-Theoretic Preliminaries | |

| |

| |

| |

Coding Preliminaries | |

| |

| |

| |

Restriction to Prefix Coding Systems; Descriptions as Messages | |

| |

| |

| |

Different Kinds of Codes | |

| |

| |

| |

Assessing the Efficiency of Description Methods | |

| |

| |

| |

The Most Important Section of This Book: Probabilities and Code Lengths | |

| |

| |

| |

The Kraft Inequality | |

| |

| |

| |

Code Lengths "Are" Probabilities | |

| |

| |

| |

Immediate Insights and Consequences | |

| |

| |

| |

Probabilities and Code Lengths, Part II | |

| |

| |

| |

(Relative) Entropy and the Information Inequality | |

| |

| |

| |

Uniform Codes, Maximum Entropy, and Minimax Codelength | |

| |

| |

| |

Summary, Outlook, Further Reading | |

| |

| |

| |

Information-Theoretic Properties of Statistical Models | |

| |

| |

| |

Introduction | |

| |

| |

| |

Likelihood and Observed Fisher Information | |

| |

| |

| |

KL Divergence and Expected Fisher Information | |

| |

| |

| |

Maximum Likelihood: Data vs. Parameters | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

Crude Two-Part Code MDL | |

| |

| |

| |

Introduction: Making Two-Part MDL Precise | |

| |

| |

| |

Two-Part Code MDL for Markov Chain Selection | |

| |

| |

| |

The Code C[subscript 2] | |

| |

| |

| |

The Code C[subscript 1] | |

| |

| |

| |

Crude Two-Part Code MDL for Markov Chains | |

| |

| |

| |

Simplistic Two-Part Code MDL Hypothesis Selection | |

| |

| |

| |

Two-Part MDL for Tasks Other Than Hypothesis Selection | |

| |

| |

| |

Behavior of Two-Part Code MDL | |

| |

| |

| |

Two-Part Code MDL and Maximum Likelihood | |

| |

| |

| |

The Maximum Likelihood Principle | |

| |

| |

| |

MDL vs. ML | |

| |

| |

| |

MDL as a Maximum Probability Principle | |

| |

| |

| |

Computing and Approximating Two-Part MDL in Practice | |

| |

| |

| |

Justifying Crude MDL: Consistency and Code Design | |

| |

| |

| |

A General Consistency Result | |

| |

| |

| |

Code Design for Two-Part Code MDL | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

Appendix: Proof of Theorem 5.1 | |

| |

| |

| |

Universal Coding | |

| |

| |

| |

Universal Coding with Countable Models | |

| |

| |

| |

Universal Coding: The Basic Idea | |

| |

| |

| |

Two-part Codes as Simple Universal Codes | |

| |

| |

| |

From Universal Codes to Universal Models | |

| |

| |

| |

Formal Definition of Universality | |

| |

| |

| |

The Finite Case | |

| |

| |

| |

Minimax Regret and Normalized ML | |

| |

| |

| |

NML vs. Two-Part vs. Bayes | |

| |

| |

| |

The Countably Infinite Case | |

| |

| |

| |

The Two-Part and Bayesian Codes | |

| |

| |

| |

The NML Code | |

| |

| |

| |

Prequential Universal Models | |

| |

| |

| |

Distributions as Prediction Strategies | |

| |

| |

| |

Bayes Is Prequential; NML and Two-part Are Not | |

| |

| |

| |

The Prequential Plug-In Model | |

| |

| |

| |

Individual vs. Stochastic Universality | |

| |

| |

| |

Stochastic Redundancy | |

| |

| |

| |

Uniformly Universal Models | |

| |

| |

| |

Summary, Outlook and Further Reading | |

| |

| |

| |

Parametric Models: Normalized Maximum Likelihood | |

| |

| |

| |

Introduction | |

| |

| |

| |

Preliminaries | |

| |

| |

| |

Asymptotic Expansion of Parametric Complexity | |

| |

| |

| |

The Meaning of [Characters not reproducible] | |

| |

| |

| |

Complexity and Functional Form | |

| |

| |

| |

KL Divergence and Distinguishability | |

| |

| |

| |

Complexity and Volume | |

| |

| |

| |

Complexity and the Number of Distinguishable Distributions | |

| |

| |

| |

Explicit and Simplified Computations | |

| |

| |

| |

Parametric Models: Bayes | |

| |

| |

| |

The Bayesian Regret | |

| |

| |

| |

Basic Interpretation of Theorem 8.1 | |

| |

| |

| |

Bayes Meets Minimax - Jeffreys' Prior | |

| |

| |

| |

Jeffreys' Prior and the Boundary | |

| |

| |

| |

How to Prove the Bayesian and NML Regret Theorems | |

| |

| |

| |

Proof Sketch of Theorem 8.1 | |

| |

| |

| |

Beyond Exponential Families | |

| |

| |

| |

Proof Sketch of Theorem 7.1 | |

| |

| |

| |

Stochastic Universality | |

| |

| |

| |

Appendix: Proofs of Theorem 8.1 and Theorem 8.2 | |

| |

| |

| |

Parametric Models: Prequential Plug-in | |

| |

| |

| |

Prequential Plug-in for Exponential Families | |

| |

| |

| |

The Plug-in vs. the Bayes Universal Model | |

| |

| |

| |

More Precise Asymptotics | |

| |

| |

| |

Summary | |

| |

| |

| |

Parametric Models: Two-Part | |

| |

| |

| |

The Ordinary Two-Part Universal Model | |

| |

| |

| |

Derivation of the Two-Part Code Regret | |

| |

| |

| |

Proof Sketch of Theorem 10.1 | |

| |

| |

| |

Discussion | |

| |

| |

| |

The Conditional Two-Part Universal Code | |

| |

| |

| |

Conditional Two-Part Codes for Discrete Exponential Families | |

| |

| |

| |

Distinguishability and the Phase Transition | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

NML With Infinite Complexity | |

| |

| |

| |

Introduction | |

| |

| |

| |

Examples of Undefined NML Distribution | |

| |

| |

| |

Examples of Undefined Jeffreys' Prior | |

| |

| |

| |

Metauniversal Codes | |

| |

| |

| |

Constrained Parametric Complexity | |

| |

| |

| |

Meta-Two-Part Coding | |

| |

| |

| |

Renormalized Maximum Likelihood | |

| |

| |

| |

NML with Luckiness | |

| |

| |

| |

Asymptotic Expansion of LNML | |

| |

| |

| |

Conditional Universal Models | |

| |

| |

| |

Bayesian Approach with Jeffreys' Prior | |

| |

| |

| |

Conditional NML | |

| |

| |

| |

Liang and Barron's Approach | |

| |

| |

| |

Summary and Remarks | |

| |

| |

| |

Appendix: Proof of Theorem 11.4 | |

| |

| |

| |

Linear Regression | |

| |

| |

| |

Introduction | |

| |

| |

| |

Prelude: The Normal Location Family | |

| |

| |

| |

Least-Squares Estimation | |

| |

| |

| |

The Normal Equations | |

| |

| |

| |

Composition of Experiments | |

| |

| |

| |

Penalized Least-Squares | |

| |

| |

| |

The Linear Model | |

| |

| |

| |

Bayesian Linea Model M[superscript X] with Gaussian Prior | |

| |

| |

| |

Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors | |

| |

| |

| |

Universal Models for Linear Regression | |

| |

| |

| |

NML | |

| |

| |

| |

Bayes and LNML | |

| |

| |

| |

Bayes-Jeffreys and CNML | |

| |

| |

| |

Beyond Parametrics | |

| |

| |

| |

Introduction | |

| |

| |

| |

CUP: Unions of Parametric Models | |

| |

| |

| |

CUP vs. Parametric Models | |

| |

| |

| |

Universal Codes Based on Histograms | |

| |

| |

| |

Redundancy of Universal CUP Histogram Codes | |

| |

| |

| |

Nonparametric Redundancy | |

| |

| |

| |

Standard CUP Universal Codes | |

| |

| |

| |

Minimax Nonparametric Redundancy | |

| |

| |

| |

Gaussian Process Regression | |

| |

| |

| |

Kernelization of Bayesian Linear Regression | |

| |

| |

| |

Gaussian Processes | |

| |

| |

| |

Gaussian Processes as Universal Models | |

| |

| |

| |

Conclusion and Further Reading | |

| |

| |

| |

Refined MDL | |

| |

| |

| |

MDL Model Selection | |

| |

| |

| |

Introduction | |

| |

| |

| |

Simple Refined MDL Model Selection | |

| |

| |

| |

Compression Interpretation | |

| |

| |

| |

Counting Interpretation | |

| |

| |

| |

Bayesian Interpretation | |

| |

| |

| |

Prequential Interpretation | |

| |

| |

| |

General Parametric Model Selection | |

| |

| |

| |

Models with Infinite Complexities | |

| |

| |

| |

Comparing Many or Infinitely Many Models | |

| |

| |

| |

The General Picture | |

| |

| |

| |

Practical Issues in MDL Model Selection | |

| |

| |

| |

Calculating Universal Codelengths | |

| |

| |

| |

Computational Efficiency and Practical Quality of Non-NML Universal Codes | |

| |

| |

| |

Model Selection with Conditional NML and Plug-in Codes | |

| |

| |

| |

General Warnings about Model Selection | |

| |

| |

| |

MDL Model Selection for Linear Regression | |

| |

| |

| |

Rissanen's RNML Approach | |

| |

| |

| |

Hansen and Yu's gMDL Approach | |

| |

| |

| |

Liang and Barron's Approach | |

| |

| |

| |

Discussion | |

| |

| |

| |

Worst Case vs. Average Case | |

| |

| |

| |

MDL Prediction and Estimation | |

| |

| |

| |

Introduction | |

| |

| |

| |

MDL for Prediction and Predictive Estimation | |

| |

| |

| |

Prequential MDL Estimators | |

| |

| |

| |

Prequential MDL Estimators Are Consistent | |

| |

| |

| |

Parametric and Nonparametric Examples | |

| |

| |

| |

Cesaro KL consistency vs. KL consistency | |

| |

| |

| |

Two-Part Code MDL for Point Hypothesis Selection | |

| |

| |

| |

Discussion of Two-Part Consistency Theorem | |

| |

| |

| |

MDL Parameter Estimation | |

| |

| |

| |

MDL Estimators vs. Luckiness ML Estimators | |

| |

| |

| |

What Estimator To Use? | |

| |

| |

| |

Comparison to Bayesian Estimators | |

| |

| |

| |

Summary and Outlook | |

| |

| |

| |

Appendix: Proof of Theorem 15.3 | |

| |

| |

| |

MDL Consistency and Convergence | |

| |

| |

| |

Introduction | |

| |

| |

| |

The Scenarios Considered | |

| |

| |

| |

Consistency: Prequential and Two-Part MDL Estimators | |

| |

| |

| |

Consistency: MDL Model Selection | |

| |

| |

| |

Selection between a Union of Parametric Models | |

| |

| |

| |

Nonparametric Model Selection Based on CUP Model Class | |

| |

| |

| |

MDL Consistency Peculiarities | |

| |

| |

| |

Risks and Rates | |

| |

| |

| |

Relations between Divergences and Risk Measures | |

| |

| |

| |

Minimax Rates | |

| |

| |

| |

MDL Rates of Convergence | |

| |

| |

| |

Prequential and Two-Part MDL Estimators | |

| |

| |

| |

MDL Model Selection | |

| |

| |

| |

MDL in Context | |

| |

| |

| |

MDL and Frequentist Paradigms | |

| |

| |

| |

Sanity Check or Design Principle? | |

| |

| |

| |

The Weak Prequential Principle | |

| |

| |

| |

MDL vs. Frequentist Principles: Remaining Issues | |

| |

| |

| |

MDL and Bayesian Inference | |

| |

| |

| |

Luckiness Functions vs. Prior Distributions | |

| |

| |

| |

MDL, Bayes, and Occam | |

| |

| |

| |

MDL and Brands of Bayesian Statistics | |

| |

| |

| |

Conclusion: a Common Future after All? | |

| |

| |

| |

MDL, AIC and BIC | |

| |

| |

| |

BIC | |

| |

| |

| |

AIC | |

| |

| |

| |

Combining the Best of AIC and BIC | |

| |

| |

| |

MDL and MML | |

| |

| |

| |

Strict Minimum Message Length | |

| |

| |

| |

Comparison to MDL | |

| |

| |

| |

The Wallace-Freeman Estimator | |

| |

| |

| |

MDL and Prequential Analysis | |

| |

| |

| |

MDL and Cross-Validation | |

| |

| |

| |

MDL and Maximum Entropy | |

| |

| |

| |

Kolmogorov Complexity and Structure Function | |

| |

| |

| |

MDL and Individual Sequence Prediction | |

| |

| |

| |

MDL and Statistical Learning Theory | |

| |

| |

| |

Structural Risk Minimization | |

| |

| |

| |

PAC-Bayesian Approaches | |

| |

| |

| |

PAC-Bayesand MDL | |

| |

| |

| |

The Road Ahead | |

| |

| |

| |

Additional Background | |

| |

| |

| |

The Exponential or "Maximum Entropy" Families | |

| |

| |

| |

Introduction | |

| |

| |

| |

Definition and Overview | |

| |

| |

| |

Basic Properties | |

| |

| |

| |

Mean-Value, Canonical, and Other Parameterizations | |

| |

| |

| |

The Mean Value Parameterization | |

| |

| |

| |

Other Parameterizations | |

| |

| |

| |

Relating Mean-Value and Canonical Parameters | |

| |

| |

| |

Exponential Families of General Probabilistic Sources | |

| |

| |

| |

Fisher Information Definitions and Characterizations | |

| |

| |

| |

Information-Theoretic Properties of Exponential Families | |

| |

| |

| |

Introduction | |

| |

| |

| |

Robustness of Exponential Family Codes | |

| |

| |

| |

If [Characters not reproducible][subscript mean] Does Not Contain the Mean | |

| |

| |

| |

Behavior at the ML Estimate [Beta] | |

| |

| |

| |

Behavior of the ML Estimate [Beta] | |

| |

| |

| |

Central Limit Theorem | |

| |

| |

| |

Large Deviations | |

| |

| |

| |

Maximum Entropy and Minimax Codelength | |

| |

| |

| |

Exponential Families and Maximum Entropy | |

| |

| |

| |

Exponential Families and Minimax Codelength | |

| |

| |

| |

The Compression Game | |

| |

| |

| |

Likelihood Ratio Families and Renyi Divergences | |

| |

| |

| |

The Likelihood Ratio Family | |

| |

| |

| |

Summary | |

| |

| |

References | |

| |

| |

List of Symbols | |

| |

| |

Subject Index | |