| |
| |
Preface to the First Edition | |
| |
| |
Preface to the Second Edition | |
| |
| |
List of Abbreviations | |
| |
| |
| |
Human Speech Communication | |
| |
| |
| |
Value of speech for human-machine communication | |
| |
| |
| |
Ideas and language | |
| |
| |
| |
Relationship between written and spoken language | |
| |
| |
| |
Phonetics and phonology | |
| |
| |
| |
The acoustic signal | |
| |
| |
| |
Phonemes, phones and allophones | |
| |
| |
| |
Vowels, consonants and syllables | |
| |
| |
| |
Phonemes and spelling | |
| |
| |
| |
Prosodic features | |
| |
| |
| |
Language, accent and dialect | |
| |
| |
| |
Supplementing the acoustic signal | |
| |
| |
| |
The complexity of speech processing | |
| |
| |
Chapter 1 summary | |
| |
| |
Chapter 1 exercises | |
| |
| |
| |
Mechanisms and Models of Human Speech Production | |
| |
| |
| |
Introduction | |
| |
| |
| |
Sound sources | |
| |
| |
| |
The resonant system | |
| |
| |
| |
Interaction of laryngeal and vocal tract functions | |
| |
| |
| |
Radiation | |
| |
| |
| |
Waveforms and spectrograms | |
| |
| |
| |
Speech production models | |
| |
| |
| |
Excitation models | |
| |
| |
| |
Vocal tract models | |
| |
| |
Chapter 2 summary | |
| |
| |
Chapter 2 exercises | |
| |
| |
| |
Mechanisms and Models of the Human Auditory System | |
| |
| |
| |
Introduction | |
| |
| |
| |
Physiology of the outer and middle ears | |
| |
| |
| |
Structure of the cochlea | |
| |
| |
| |
Neural response | |
| |
| |
| |
Psychophysical measurements | |
| |
| |
| |
Analysis of simple and complex signals | |
| |
| |
| |
Models of the auditory system | |
| |
| |
| |
Mechanical filtering | |
| |
| |
| |
Models of neural transduction | |
| |
| |
| |
Higher-level neural processing | |
| |
| |
Chapter 3 summary | |
| |
| |
Chapter 3 exercises | |
| |
| |
| |
Digital Coding of Speech | |
| |
| |
| |
Introduction | |
| |
| |
| |
Simple waveform coders | |
| |
| |
| |
Pulse code modulation | |
| |
| |
| |
Deltamodulation | |
| |
| |
| |
Analysis/synthesis systems (vocoders) | |
| |
| |
| |
Channel vocoders | |
| |
| |
| |
Sinusoidal coders | |
| |
| |
| |
LPC vocoders | |
| |
| |
| |
Formant vocoders | |
| |
| |
| |
Efficient parameter coding | |
| |
| |
| |
Vocoders based on segmental/phonetic structure | |
| |
| |
| |
Intermediate systems | |
| |
| |
| |
Sub-band coding | |
| |
| |
| |
Linear prediction with simple coding of the residual | |
| |
| |
| |
Adaptive predictive coding | |
| |
| |
| |
Multipulse LPC | |
| |
| |
| |
Code-excited linear prediction | |
| |
| |
| |
Evaluating speech coding algorithms | |
| |
| |
| |
Subjective speech intelligibility measures | |
| |
| |
| |
Subjective speech quality measures | |
| |
| |
| |
Objective speech quality measures | |
| |
| |
| |
Choosing a coder | |
| |
| |
Chapter 4 summary | |
| |
| |
Chapter 4 exercises | |
| |
| |
| |
Message Synthesis from Stored Human Speech Components | |
| |
| |
| |
Introduction | |
| |
| |
| |
Concatenation of whole words | |
| |
| |
| |
Simple waveform concatenation | |
| |
| |
| |
Concatenation of vocoded words | |
| |
| |
| |
Limitations of concatenating word-size units | |
| |
| |
| |
Concatenation of sub-word units: general principles | |
| |
| |
| |
Choice of sub-word unit | |
| |
| |
| |
Recording and selecting data for the units | |
| |
| |
| |
Varying durations of concatenative units | |
| |
| |
| |
Synthesis by concatenating vocoded sub-word units | |
| |
| |
| |
Synthesis by concatenating waveform segments | |
| |
| |
| |
Pitch modification | |
| |
| |
| |
Timing modification | |
| |
| |
| |
Performance of waveform concatenation | |
| |
| |
| |
Variants of concatenative waveform synthesis | |
| |
| |
| |
Hardware requirements | |
| |
| |
Chapter 5 summary | |
| |
| |
Chapter 5 exercises | |
| |
| |
| |
Phonetic synthesis by rule | |
| |
| |
| |
Introduction | |
| |
| |
| |
Acoustic-phonetic rules | |
| |
| |
| |
Rules for formant synthesizers | |
| |
| |
| |
Table-driven phonetic rules | |
| |
| |
| |
Simple transition calculation | |
| |
| |
| |
Overlapping transitions | |
| |
| |
| |
Using the tables to generate utterances | |
| |
| |
| |
Optimizing phonetic rules | |
| |
| |
| |
Automatic adjustment of phonetic rules | |
| |
| |
| |
Rules for different speaker types | |
| |
| |
| |
Incorporating intensity rules | |
| |
| |
| |
Current capabilities of phonetic synthesis by rule | |
| |
| |
Chapter 6 summary | |
| |
| |
Chapter 6 exercises | |
| |
| |
| |
Speech Synthesis from Textual or Conceptual Input | |
| |
| |
| |
Introduction | |
| |
| |
| |
Emulating the human speaking process | |
| |
| |
| |
Converting from text to speech | |
| |
| |
| |
TTS system architecture | |
| |
| |
| |
Overview of tasks required for TTS conversion | |
| |
| |
| |
Text analysis | |
| |
| |
| |
Text pre-processing | |
| |
| |
| |
Morphological analysis | |
| |
| |
| |
Phonetic transcription | |
| |
| |
| |
Syntactic analysis and prosodic phrasing | |
| |
| |
| |
Assignment of lexical stress and pattern of word accents | |
| |
| |
| |
Prosody generation | |
| |
| |
| |
Timing pattern | |
| |
| |
| |
Fundamental frequency contour | |
| |
| |
| |
Implementation issues | |
| |
| |
| |
Current TTS synthesis capabilities | |
| |
| |
| |
Speech synthesis from concept | |
| |
| |
Chapter 7 summary | |
| |
| |
Chapter 7 exercises | |
| |
| |
| |
Introduction to automatic speech recognition: template matching | |
| |
| |
| |
Introduction | |
| |
| |
| |
General principles of pattern matching | |
| |
| |
| |
Distance metrics | |
| |
| |
| |
Filter-bank analysis | |
| |
| |
| |
Level normalization | |
| |
| |
| |
End-point detection for isolated words | |
| |
| |
| |
Allowing for timescale variations | |
| |
| |
| |
Dynamic programming for time alignment | |
| |
| |
| |
Refinements to isolated-word DP matching | |
| |
| |
| |
Score pruning | |
| |
| |
| |
Allowing for end-point errors | |
| |
| |
| |
Dynamic programming for connected words | |
| |
| |
| |
Continuous speech recognition | |
| |
| |
| |
Syntactic constraints | |
| |
| |
| |
Training a whole-word recognizer | |
| |
| |
Chapter 8 summary | |
| |
| |
Chapter 8 exercises | |
| |
| |
| |
Introduction to stochastic modelling | |
| |
| |
| |
Feature variability in pattern matching | |
| |
| |
| |
Introduction to hidden Markov models | |
| |
| |
| |
Probability calculations in hidden Markov models | |
| |
| |
| |
The Viterbi algorithm | |
| |
| |
| |
Parameter estimation for hidden Markov models | |
| |
| |
| |
Forward and backward probabilities | |
| |
| |
| |
Parameter re-estimation with forward and backward probabilities | |
| |
| |
| |
Viterbi training | |
| |
| |
| |
Vector quantization | |
| |
| |
| |
Multi-variate continuous distributions | |
| |
| |
| |
Use of normal distributions with HMMs | |
| |
| |
| |
Probability calculations | |
| |
| |
| |
Estimating the parameters of a normal distribution | |
| |
| |
| |
Baum-Welch re-estimation | |
| |
| |
| |
Viterbi training | |
| |
| |
| |
Model initialization | |
| |
| |
| |
Gaussian mixtures | |
| |
| |
| |
Calculating emission probabilities | |
| |
| |
| |
Baum-Welch re-estimation | |
| |
| |
| |
Re-estimation using the most likely state sequence | |
| |
| |
| |
Initialization of Gaussian mixture distributions | |
| |
| |
| |
Tied mixture distributions | |
| |
| |
| |
Extension of stochastic models to word sequences | |
| |
| |
| |
Implementing probability calculations | |
| |
| |
| |
Using the Viterbi algorithm with probabilities in logarithmic form | |
| |
| |
| |
Adding probabilities when they are in logarithmic form | |
| |
| |
| |
Relationship between DTW and a simple HMM | |
| |
| |
| |
State durational characteristics of HMMs | |
| |
| |
Chapter 9 summary | |
| |
| |
Chapter 9 exercises | |
| |
| |
| |
Introduction to front-end analysis for automatic speech recognition | |
| |
| |
| |
Introduction | |
| |
| |
| |
Pre-emphasis | |
| |
| |
| |
Frames and windowing | |
| |
| |
| |
Filter banks, Fourier analysis and the mel scale | |
| |
| |
| |
Cepstral analysis | |
| |
| |
| |
Analysis based on linear prediction | |
| |
| |
| |
Dynamic features | |
| |
| |
| |
Capturing the perceptually relevant information | |
| |
| |
| |
General feature transformations | |
| |
| |
| |
Variable-frame-rate analysis | |
| |
| |
Chapter 10 summary | |
| |
| |
Chapter 10 exercises | |
| |
| |
| |
Practical techniques for improving speech recognition performance | |
| |
| |
| |
Introduction | |
| |
| |
| |
Robustness to environment and channel effects | |
| |
| |
| |
Feature-based techniques | |
| |
| |
| |
Model-based techniques | |
| |
| |
| |
Dealing with unknown or unpredictable noise corruption | |
| |
| |
| |
Speaker-independent recognition | |
| |
| |
| |
Speaker normalization | |
| |
| |
| |
Model adaptation | |
| |
| |
| |
Bayesian methods for training and adaptation of HMMs | |
| |
| |
| |
Adaptation methods based on linear transforms | |
| |
| |
| |
Discriminative training methods | |
| |
| |
| |
Maximum mutual information training | |
| |
| |
| |
Training criteria based on reducing recognition errors | |
| |
| |
| |
Robustness of recognizers to vocabulary variation | |
| |
| |
Chapter 11 summary | |
| |
| |
Chapter 11 exercises | |
| |
| |
| |
Automatic speech recognition for large vocabularies | |
| |
| |
| |
Introduction | |
| |
| |
| |
Historical perspective | |
| |
| |
| |
Speech transcription and speech understanding | |
| |
| |
| |
Speech transcription | |
| |
| |
| |
Challenges posed by large vocabularies | |
| |
| |
| |
Acoustic modelling | |
| |
| |
| |
Context-dependent phone modelling | |
| |
| |
| |
Training issues for context-dependent models | |
| |
| |
| |
Parameter tying | |
| |
| |
| |
Training procedure | |
| |
| |
| |
Methods for clustering model parameters | |
| |
| |
| |
Constructing phonetic decision trees | |
| |
| |
| |
Extensions beyond triphone modelling | |
| |
| |
| |
Language modelling | |
| |
| |
| |
N-grams | |
| |
| |
| |
Perplexity and evaluating language models | |
| |
| |
| |
Data sparsity in language modelling | |
| |
| |
| |
Discounting | |
| |
| |
| |
Backing off in language modelling | |
| |
| |
| |
Interpolation of language models | |
| |
| |
| |
Choice of more general distribution for smoothing | |
| |
| |
| |
Improving on simple N-grams | |
| |
| |
| |
Decoding | |
| |
| |
| |
Efficient one-pass Viterbi decoding for large vocabularies | |
| |
| |
| |
Multiple-pass Viterbi decoding | |
| |
| |
| |
Depth-first decoding | |
| |
| |
| |
Evaluating LVCSR performance | |
| |
| |
| |
Measuring errors | |
| |
| |
| |
Controlling word insertion errors | |
| |
| |
| |
Performance evaluations | |
| |
| |
| |
Speech understanding | |
| |
| |
| |
Measuring and evaluating speech understanding performance | |
| |
| |
Chapter 12 summary | |
| |
| |
Chapter 12 exercises | |
| |
| |
| |
Neural networks for speech recognition | |
| |
| |
| |
Introduction | |
| |
| |
| |
The human brain | |
| |
| |
| |
Connectionist models | |
| |
| |
| |
Properties of ANNs | |
| |
| |
| |
ANNs for speech recognition | |
| |
| |
| |
Hybrid HMM/ANN methods | |
| |
| |
Chapter 13 summary | |
| |
| |
Chapter 13 exercises | |
| |
| |
| |
Recognition of speaker characteristics | |
| |
| |
| |
Characteristics of speakers | |
| |
| |
| |
Verification versus identification | |
| |
| |
| |
Assessing performance | |
| |
| |
| |
Measures of verification performance | |
| |
| |
| |
Speaker recognition | |
| |
| |
| |
Text dependence | |
| |
| |
| |
Methods for text-dependent/text-prompted speaker recognition | |
| |
| |
| |
Methods for text-independent speaker recognition | |
| |
| |
| |
Acoustic features for speaker recognition | |
| |
| |
| |
Evaluations of speaker recognition performance | |
| |
| |
| |
Language recognition | |
| |
| |
| |
Techniques for language recognition | |
| |
| |
| |
Acoustic features for language recognition | |
| |
| |
Chapter 14 summary | |
| |
| |
Chapter 14 exercises | |
| |
| |
| |
Applications and performance of current technology | |
| |
| |
| |
Introduction | |
| |
| |
| |
Why use speech technology? | |
| |
| |
| |
Speech synthesis technology | |
| |
| |
| |
Examples of speech synthesis applications | |
| |
| |
| |
Aids for the disabled | |
| |
| |
| |
Spoken warning signals, instructions and user feedback | |
| |
| |
| |
Education, toys and games | |
| |
| |
| |
Telecommunications | |
| |
| |
| |
Speech recognition technology | |
| |
| |
| |
Characterizing speech recognizers and recognition tasks | |
| |
| |
| |
Typical recognition performance for different tasks | |
| |
| |
| |
Achieving success with ASR in an application | |
| |
| |
| |
Examples of ASR applications | |
| |
| |
| |
Command and control | |
| |
| |
| |
Education, toys and games | |
| |
| |
| |
Dictation | |
| |
| |
| |
Data entry and retrieval | |
| |
| |
| |
Telecommunications | |
| |
| |
| |
Applications of speaker and language recognition | |
| |
| |
| |
The future of speech technology applications | |
| |
| |
Chapter 15 summary | |
| |
| |
Chapter 15 exercises | |
| |
| |
| |
Future research directions in speech synthesis and recognition | |
| |
| |
| |
Introduction | |
| |
| |
| |
Speech synthesis | |
| |
| |
| |
Speech sound generation | |
| |
| |
| |
Prosody generation and higher-level linguistic processing | |
| |
| |
| |
Automatic speech recognition | |
| |
| |
| |
Advantages of statistical pattern-matching methods | |
| |
| |
| |
Limitations of HMMs for speech recognition | |
| |
| |
| |
Developing improved recognition models | |
| |
| |
| |
Relationship between synthesis and recognition | |
| |
| |
| |
Automatic speech understanding | |
| |
| |
Chapter 16 summary | |
| |
| |
Chapter 16 exercises | |
| |
| |
| |
Further Reading | |
| |
| |
| |
Books | |
| |
| |
| |
Journals | |
| |
| |
| |
Conferences and workshops | |
| |
| |
| |
The Internet | |
| |
| |
| |
Reading for individual chapters | |
| |
| |
References | |
| |
| |
Solutions to Exercises | |
| |
| |
Glossary | |
| |
| |
Index | |