This paper develops a speech-recognition system that converts voice audio into text through the new deep learning techniques. It was trained on the vVISWa dataset which provides a large number of video and audio samples of pronunciation of fruits, cities, and numbers categories. The videos were encoded into WAV audio files with noise removal, trimming of silences, normalization and band-pass filtering to enhance clarity. MFCC was the extracted features of the cleaned audio, which describes the significant features of human speech. Based on these features, two models, a Support Vector machine (SVM) and a Deep Neural Network (DNN) were trained and tested. SVM had an accuracy of 95.12%, and DNN had an even higher accuracy of 98.46. This indicates that it can acquire complex speech patterns. Also, the system featured an Automatic Speech Recognition (ASR) component that converted the audio into text, which gave it a low Word Error Rate of 6.8% and a Character Error Rate of 3.1%. In general, the findings indicate that deep learning, especially the DNNs-based approach, is highly accurate, strong, and reliable to deal with real speech and audio processing tasks.
Keywords
^
Conclusion
This paper has come up with a complete audio-classification and speech-recognition system that included, video to audio extraction, preprocessing, MFCC feature analysis and machine learning models. The comparative analysis of the AI models has revealed that the DNN was more effective than the SVM; the latter had the accuracy of 95.12, and the former had the accuracy of 98.46, as well as higher precision and recall. The speech-to-text component was also doing well with a Word Error Rate (WER) of 6.8 percent and Character Error Rate (CER) of 3.1 percent which is a good transcription result.
These results highlight that deep-learning models are more efficient in dealing with speech, background noise, and nonlinear audio characteristics variations as compared to traditional models. Thus, the presented system is efficient, precise, as well as the fact that deep learning is a strong solution to modern audio and speech-recognition applications.
References
[1] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., … & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97.
[2] Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). Ieee.
[3] Rabiner, L. R. (2002). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
[4] Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10), 1533-1545.
[5] Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. arXiv preprint arXiv:1508.01211.
[6] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., … & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR.
[7] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
[8] Graves, A., & Jaitly, N. (2014, June). Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning (pp. 1764-1772). PMLR.
[9] Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., … & Wilson, K. (2017, March). CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 131-135). IEEE.
[10] Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015, April). Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). Ieee.
[11] Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). Ieee.
[12] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[13] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … & Vesely, K. (2011, December). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (Vol. 1, pp. 5-1).
[14] Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I.
[15] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
[16] Borde, P., Manza, R., Gawali, B., & Yannawar, P. (2004). vviswa–a multilingual multi-pose audio visual database for robust human computer interaction. International Journal of Computer Applications, 137(4), 25-31.