| |
| |
Contents | |
| |
| |
Series Foreword | |
| |
| |
Preface | |
| |
| |
| |
The Problem | |
| |
| |
| |
Introduction | |
| |
| |
| |
Reinforcement Learning | |
| |
| |
| |
Examples | |
| |
| |
| |
Elements of Reinforcement Learning | |
| |
| |
| |
An Extended Example: Tic-Tac-Toe | |
| |
| |
| |
Summary | |
| |
| |
| |
History of Reinforcement Learning | |
| |
| |
| |
Bibliographical Remarks | |
| |
| |
| |
Evaluative Feedback | |
| |
| |
| |
An n-Armed Bandit Problem | |
| |
| |
| |
Action-Value Methods | |
| |
| |
| |
Softmax Action Selection | |
| |
| |
| |
Evaluation Versus Instruction | |
| |
| |
| |
Incremental Implementation | |
| |
| |
| |
Tracking a Nonstationary Problem | |
| |
| |
| |
Optimistic Initial Values | |
| |
| |
| |
Reinforcement Comparison | |
| |
| |
| |
Pursuit Methods | |
| |
| |
| |
Associative Search | |
| |
| |
| |
Conclusions | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
The Reinforcement Learning Problem | |
| |
| |
| |
The Agent-Environment Interface | |
| |
| |
| |
Goals and Rewards | |
| |
| |
| |
Returns | |
| |
| |
| |
Unified Notation for Episodic and Continuing Tasks | |
| |
| |
| |
The Markov Property | |
| |
| |
| |
Markov Decision Processes | |
| |
| |
| |
Value Functions | |
| |
| |
| |
Optimal Value Functions | |
| |
| |
| |
Optimality and Approximation | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Elementary Solution Methods | |
| |
| |
| |
Dynamic Programming | |
| |
| |
| |
Policy Evaluation | |
| |
| |
| |
Policy Improvement | |
| |
| |
| |
Policy Iteration | |
| |
| |
| |
Value Iteration | |
| |
| |
| |
Asynchronous Dynamic Programming | |
| |
| |
| |
Generalized Policy Iteration | |
| |
| |
| |
Efficiency of Dynamic Programming | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Monte Carlo Methods | |
| |
| |
| |
Monte Carlo Policy Evaluation | |
| |
| |
| |
Monte Carlo Estimation of Action Values | |
| |
| |
| |
Monte Carlo Control | |
| |
| |
| |
On-Policy Monte Carlo Control | |
| |
| |
| |
Evaluating One Policy While Following Another | |
| |
| |
| |
Off-Policy Monte Carlo Control | |
| |
| |
| |
Incremental Implementation | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Temporal-Difference Learning | |
| |
| |
| |
TD Prediction | |
| |
| |
| |
Advantages of TD Prediction Methods | |
| |
| |
| |
Optimality of TD(O) | |
| |
| |
| |
Sarsa: On-Policy TD Control | |
| |
| |
| |
Q-Learning: Off-Policy TD Control | |
| |
| |
| |
Actor-Critic Methods | |
| |
| |
| |
R-Learning for Undiscounted Continuing Tasks | |
| |
| |
| |
Games, Afterstates, and Other Special Cases | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
A Unified View | |
| |
| |
| |
Eligibility Traces | |
| |
| |
| |
n-Step TD Prediction | |
| |
| |
| |
The Forward View of TD(l) | |
| |
| |
| |
The Backward View of TD(l) | |
| |
| |
| |
Equivalence of Forward and Backward Views | |
| |
| |
| |
Sarsa(l) | |
| |
| |
| |
Q(l) | |
| |
| |
| |
Eligibility Traces for Actor-Critic Methods | |
| |
| |
| |
Replacing Traces | |
| |
| |
| |
Implementation Issues | |
| |
| |
| |
Variable l | |
| |
| |
| |
Conclusions | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Generalization and Function Approximation | |
| |
| |
| |
Value Prediction with Function Approximation | |
| |
| |
| |
Gradient-Descent Methods | |
| |
| |
| |
Linear Methods | |
| |
| |
| |
Control with Function Approximation | |
| |
| |
| |
Off-Policy Bootstrapping | |
| |
| |
| |
Should We Bootstrap? | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Planning and Learning | |
| |
| |
| |
Models and Planning | |
| |
| |
| |
Integrating Planning, Acting, and Learning | |
| |
| |
| |
When the Model Is Wrong | |
| |
| |
| |
Prioritized Sweeping | |
| |
| |
| |
Full vs. Sample Backups | |
| |
| |
| |
Trajectory Sampling | |
| |
| |
| |
Heuristic Search | |
| |
| |
| |
Summary | |
| |
| |
| |
Bibliographical and Historical Remarks | |
| |
| |
| |
Dimensions of Reinforcement Learning | |
| |
| |
| |
The Unified View | |
| |
| |
| |
Other Frontier Dimensions | |
| |
| |
| |
Case Studies | |
| |
| |
| |
TD-Gammon | |
| |
| |
| |
Samuel's Checkers Player | |
| |
| |
| |
The Acrobot | |
| |
| |
| |
Elevator Dispatching | |
| |
| |
| |
Dynamic Channel Allocation | |
| |
| |
| |
Job-Shop Scheduling | |
| |
| |
References | |
| |
| |
Summary of Notation | |
| |
| |
Index | |