Reducing the computational complexity for whole word models
2017 IEEE Automatic Speech Recognition and Understanding Workshop …, 2017•ieeexplore.ieee.org
In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified,
large vocabulary continuous speech recognition system with whole words as acoustic units.
In that system, we model about 100,000 words directly using deep bi-directional LSTM
RNNs. To alleviate the data sparsity problem for word models, we train the model on
125,000 hours of semi-supervised acoustic training data. The resulting model works very
well as an end-to-end all-neural speech recognition model without the use of any language …
large vocabulary continuous speech recognition system with whole words as acoustic units.
In that system, we model about 100,000 words directly using deep bi-directional LSTM
RNNs. To alleviate the data sparsity problem for word models, we train the model on
125,000 hours of semi-supervised acoustic training data. The resulting model works very
well as an end-to-end all-neural speech recognition model without the use of any language …
In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.
ieeexplore.ieee.org
Showing the best result for this search. See all results