Articulatory features define properties of speech production, i.e. they describe the basic sounds we make. Phonemes on the other hand are the smallest units of sound used to form meaningful speech. In Finnish basically all the phonemes correspond to a letter, whereas in English they do not. However, phonemes are used to model pronounciation, and they are therefore cross- and multilingual.
The authors did experiments on their model using the TIMIT corpus, containing speech from American English speakers of different sexes and dialects. The corpus also contains the correct phonemes used in the speech. Following methods for phoneme recognition were applied:
- Independent MLP (multilayer perceptron)
- Multitask MLP
- Phoneme MLP
Independent MLP is a standard method, whereas (2) and (3) are novel methods presented in their paper. In each method, articulatory features were learned from the audio, and an MLP network was trained to predict the phonemes. In independent MLP the classifiers are independent. However, since the features actually are interrelated, multitask learning was considered to be needed. The prediction accuracies (speech to phoneme) for independent, multitask and phenome MLP were 67.4%, 68.9% and 70.2%, respectively.
Additionally, a hierarchical version was presented for each method. They performed better than the original ones, maintaining the order of performance.
Rasipuram presented their work to be continued with:
- Automatic speech recognition studies
- Different importance weights for features
- Adding gender and rate of speech as features
The talk gained some critique, as one researcher in the audience stated that performance better than this had been achieved already years ago. This wasn't really addressed by the author.