ANALIZA I OBRADA TEKSTA POMOĆU RAZLIČITIH MODELA TEMA

Olivera Hrnjaković

doi:10.24867/06BE29Hrnjakovic

Electrotechnical and Computer Engineering

Vol. 35 No. 01 (2020): Proceedings of the Faculty of Technical Sciences

TOPIC MODELS FOR TEXT ANALYSIS

Olivera Hrnjaković

06BE29Hrnjakovic.pdf (Serbian)

DOI:: https://doi.org/10.24867/06BE29Hrnjakovic
Submitted: December 28, 2019
Published: 2019-12-28

Abstract

This paper describes the current capabilities and limitations of existing topic modeling algorithms. A theoretical overview of popular topic models was given, along with all the necessary analysis and text processing steps that should be performed on the input data. The practical part of the paper is to extract topics from questions from the Stack overflow site. LSA, PLSA and LDA approaches were used and evaluated using coherence, perplexity, naming techniques and topic visualization in space. To get the best performance for topic modeling, we estimated the best topic number. The results showed that the best model is LDA and the best topic number is 6. In order to obtain a numerical evaluation of the model performance, 30 questions were manually annotated with the names of the topics acquired. In that way, we simulated a classification model. These questions were used as a test data set in the created LDA classification model. The accuracy of the classification model was 77%.

References

[1] Topic model. In Wikipedia, The Free Encyclopedia. Retrieved August, 2019, from https://en.wikipedia.org/wiki/Topic_model
[2] Python Questions from Stack Overflow Retrieved from https://www.kaggle.com/stackoverflow/pythonquestions
[3] Wang, Chong, and David M. Blei. "Collaborative topic modeling for recommending scientific articles." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
[4] Barua, Anton, Stephen W. Thomas, and Ahmed E. Hassan. "What are developers talking about? an analysis of topics and trends in stack overflow." Empirical Software Engineering 19.3 (2014): 619-654.
[5] Bergamaschi, Sonia, Laura Po, and Serena Sorrentino. "Comparing Topic Models for a Movie Recommendation System." WEBIST (2). 2014.
[6] Mimno, David, et al. "Optimizing semantic coherence in topic models." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.
[7] Perplexity To Evaluate Topic Models
Retrieved from http://qpleple.com/perplexity-to-evaluate-topic-models/
[8] tf–idf. In Wikipedia, The Free Encyclopedia. Retrieved August, 2019, from https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[9] Source code. https://github.com/laserwave/plsa/blob/master/plsa.py
[10] Evaluate Topic Models: Latent Dirichlet Allocation (LDA) Retrieved from https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
[11] Binkley, David, et al. "Understanding LDA in source code analysis." Proceedings of the 22nd international conference on program comprehension. ACM, 2014.