IJCNN 2000 - Techniques for Combining Hidden Markov Models and 
Neural Networks for Speech Recognition: A Tutorial

 

Speaker: Edmondo Trentin (trentin@fbk.eu)

Abstract
Hidden Markov models (HMM) represent the state-of-the-art approach to Automatic Speech Recognition (ASR). HMMs are effective in laboratory tests, but their applicability in real world environments is often constrained by intrinsic limitations of the models, e.g. non-discriminative training, a-priori assumptions on their underlying statistical properties, requirement of a pre-defined feature space, etc. In this respect, Artificial Neural Networks (ANN) are a promising alternative. Applied to ASR throughout a decade, ANNs yielded interesting performance on reduced-scale tasks, but they substantially failed in dealing with long time-sequences of speech signals, due to the difficulty of modeling long-term time dependencies with "conventional" ANNs. To overcome such problems, hybrid architectures were proposed, combining HMMs and ANNs within unifying frameworks, exploiting the advantages of both. Radically different techniques were introduced, according to the specific role that the ANNs had to play within the hybrid architecture. This tutorial reviews some basic concepts of ASR, HMMs and conventional ANNs for ASR. Major HMM/ANN models for ASR are then surveyed, discussing several architectures, training algorithms and experimental results from literature and from our experience. Main classes of combined HMM/ANN systems include: (1) connectionist emulation of HMMs; (2) connectionist probability estimation for HMMs; (3) ANNs as acoustic front-ends for HMMs; (4) connectionist feature extraction with joint HMM/ANN optimization; (5) vector quantization for discrete HMMs via ANNs; (6) ANNs for "rescoring" the N-best HMM hypothesis.

Outline of the Tutorial Technical Content

1. Introduction and overview

2. The ASR problem:
   2.1 Qualitative definition of the problem
   2.2 Application-oriented examples and open questions (review of
       basic concepts like speaker (in)dependence, continuous speech
       vs. isolated words, vocabulary size, noise tolerance, etc.)
   2.3 Formal definition as a classification
       problem in terms of Bayes' decision theory
   2.4 Feature extraction
       2.4.1 Example: Mel Frequency Scaled Cepstral Coefficients

3. Acoustic modeling via HMM:
   3.1 Informal introduction to HMMs
   3.2 Formal definition of HMM (states, transitions, emission
       and initial probabilities, etc.)
   3.3 Discrete vs. Continue-density HMMs
   3.4 HMMs: the "training" and "decoding" problems
       (solution to these problems based on the Baum-Welch and on
       the Viterbi algorithms is summarized, relying on the "Trellis"
       structure)
   3.5 Intrinsic limitations of HMMs (non-discriminative
       training/decoding, maximum likelihood criterion, fixed form of the
       emission probability densities, stochastic independence among acoustic
       frames, markovian assumption on the stochastic process involved,
       requirement of a pre-defined feature space, etc.).

4. Brief review of "conventional" ANNs for ASR
   4.1 ANNS as labeled graphs
   4.2 Learning as optimization of a criterion; generalization
   4.3 Summary of major connectionist architectures for ASR
   4.4 The problem of dealing with long-term time dependencies
       in conventional ANNs

5. Combining HMMs and ANNs
   5.1 Motivations and basic ideas
   5.2 Classes of HMM/ANN hybrid systems for ASR:
       5.2.1 ANNs that emulate HMMs (in a historical perspective, we
                 start by reviewing the Viterbi Net and the Alpha Net, two
                 recurrent architectures that attempted to emulate simple
                 left-to-right HMMs for isolated words recognition).
       5.2.2 connectionist probability estimation for HMMs (basically
                Bourlard and Morgan's approach, where MLPs are used to
                estimate the posterior probability of HMM states instead
                of the usual Gaussian emission probabilities; variants on
                this approach are also discussed).
       5.2.3 ANNs as acoustic front-ends for HMMs (in speaker
                 normalization and channel compensation, ANNs are trained
                 to perform a transformation of the feature vectors to be
                 fed into the HMM; particular attention is paid to a spectral
                 mapping approach based on a mixture of recurrent ANNs).
       5.2.4 connectionist feature extraction with joint HMM/ANN
                 optimization (basically Y. Bengio's approach, where the
                 ANN is used as a feature extractor for a HMM, but both
                 models are jointly trained on a global optimization criterion;
                 a possible extension of this important, novel algorithm to
                 Bourlard's model is introduced, too).
       5.2.5 vector quantization for discrete HMMs via ANNs
                (unsupervised, e.g. competitive, ANNs are used to discretize
                the acoustic space in order to obtain a finite codebook
                of prototypes for discrete HMMs).
       5.2.6 Other approaches (other, non-homogeneous architectures
                 are briefly reviewed, in particular ANNs for "rescoring"
                 the N-best hypothesis yielded by a standard HMM)

6. Conclusions
   6.1 Summary of the tutorial
   6.2 Emphasis on major topics
   6.3 Some guidelines for future research
   6.4 Conclusions

Schedule
This Tutorial was selected for being part of the Preliminary Technical Program of IJCNN 2000, to be held in Como (Italy) from 24th to 27th of July, 2000. It is scheduled as Tutorial #5 on Saturday Afternoon, 22 July 2000. It will take about four hours, including breaks and questions. Please refer to the IJCNN 2000 Official Site (click "Technical program" on the left side of that page) for up-to-date details and news.

Registration
To attend the Tutorial you previously need a regular registration for the Conference. In addition, a specific registration for each Tutorial is necessary. You can register to any tutorial you like by using the conference registration form. The Tutorial will be held only if a minimum number of registered attendees will be reached by 1 July, 2000. Please refer to the IJCNN 2000 Registration page for detailed information, deadlines, fees and student grants (on the same page you will also find tourist infos and Hotel reservation forms), as well as for your actual registration. 


Parallel Event:


Back to Edmondo Trentin's Home Page.

(Last updated: Mar 28, 2000)