ANALYSIS AND FINE-TUNING OF THE Wav2vec 2.0 MODEL FOR SERBIAN SPEECH RECOGNITION
DOI:
https://doi.org/10.24867/30BE40BanovacKeywords:
Speech recognition, self-supervised learning, Wav2vec modelAbstract
This paper focuses on the application of the Wav2vec 2.0 model for speech recognition in Serbian. It includes an analysis of the original model, trained using a self-supervised learning technique on unlabeled data, and the fine-tuning phase for Serbian. The original implementation, with 53,000 hours of unlabeled and 10 minutes of labeled data, achieved a WER of 4.8/8.2, demonstrating the effectiveness of the applied methods. The goal is to explore the applicability of these methods to Serbian audio data. The results for Serbian show a WER of 10.3% and a CER of 3.4%, with room for further improvements.
References
[1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Oct. 22, 2020, arXiv: arXiv:2006.11477. Accessed: Sep. 24, 2024. [Online]. Available: http://arxiv.org/abs/2006.11477
[2] “Common Voice.” Accessed: Sep. 24, 2024. [Online]. Available: https://commonvoice.mozilla.org/en
[3] “ASR training dataset for Serbian JuzneVesti-SR v1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1679
[4] “Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1834
[5] A. Vaswani et al., “Attention Is All You Need,” 2017, arXiv. doi: 10.48550/ARXIV.1706.03762.
[6] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Jan. 22, 2019, arXiv: arXiv:1807.03748. Accessed: Sep. 24, 2024. [Online]. Available: http://arxiv.org/abs/1807.03748
[7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning - ICML ’06, Pittsburgh, Pennsylvania: ACM Press, 2006, pp. 369–376. doi: 10.1145/1143844.1143891.
[2] “Common Voice.” Accessed: Sep. 24, 2024. [Online]. Available: https://commonvoice.mozilla.org/en
[3] “ASR training dataset for Serbian JuzneVesti-SR v1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1679
[4] “Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1834
[5] A. Vaswani et al., “Attention Is All You Need,” 2017, arXiv. doi: 10.48550/ARXIV.1706.03762.
[6] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Jan. 22, 2019, arXiv: arXiv:1807.03748. Accessed: Sep. 24, 2024. [Online]. Available: http://arxiv.org/abs/1807.03748
[7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning - ICML ’06, Pittsburgh, Pennsylvania: ACM Press, 2006, pp. 369–376. doi: 10.1145/1143844.1143891.
Downloads
Published
2025-04-04
Issue
Section
Electrotechnical and Computer Engineering