ANALYSIS AND FINE-TUNING OF THE Wav2vec 2.0 MODEL FOR SERBIAN SPEECH RECOGNITION

Authors

  • Тамара Бановац Autor

DOI:

https://doi.org/10.24867/30BE40Banovac

Keywords:

Speech recognition, self-supervised learning, Wav2vec model

Abstract

This paper focuses on the application of the Wav2vec 2.0 model for speech recognition in Serbian. It includes an analysis of the original model, trained using a self-supervised learning technique on unlabeled data, and the fine-tuning phase for Serbian. The original implementation, with 53,000 hours of unlabeled and 10 minutes of labeled data, achieved a WER of 4.8/8.2, demonstrating the effectiveness of the applied methods. The goal is to explore the applicability of these methods to Serbian audio data. The results for Serbian show a WER of 10.3% and a CER of 3.4%, with room for further improvements.

References

[1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Oct. 22, 2020, arXiv: arXiv:2006.11477. Accessed: Sep. 24, 2024. [Online]. Available: http://arxiv.org/abs/2006.11477
[2] “Common Voice.” Accessed: Sep. 24, 2024. [Online]. Available: https://commonvoice.mozilla.org/en
[3] “ASR training dataset for Serbian JuzneVesti-SR v1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1679
[4] “Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0.” Accessed: Sep. 24, 2024. [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1834
[5] A. Vaswani et al., “Attention Is All You Need,” 2017, arXiv. doi: 10.48550/ARXIV.1706.03762.
[6] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Jan. 22, 2019, arXiv: arXiv:1807.03748. Accessed: Sep. 24, 2024. [Online]. Available: http://arxiv.org/abs/1807.03748
[7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning - ICML ’06, Pittsburgh, Pennsylvania: ACM Press, 2006, pp. 369–376. doi: 10.1145/1143844.1143891.

Published

2025-04-04

Issue

Section

Electrotechnical and Computer Engineering