TOPIC MODELS FOR TEXT ANALYSIS

Authors

  • Olivera Hrnjaković Autor

DOI:

https://doi.org/10.24867/06BE29Hrnjakovic

Keywords:

Topic modeling, text analysis, LDA

Abstract

This paper describes the current capabilities and limitations of existing topic modeling algorithms. A theoretical overview of popular topic models was given, along with all the necessary analysis and text processing steps that should be performed on the input data. The practical part of the paper is to extract topics from questions from the Stack overflow site. LSA, PLSA and LDA approaches were used and evaluated using coherence, perplexity, naming techniques and topic visualization in space. To get the best performance for topic modeling, we estimated the best topic number. The results showed that the best model is LDA and the best topic number is 6. In order to obtain a numerical evaluation of the model performance, 30 questions were manually annotated with the names of the topics acquired. In that way, we simulated a classification model. These questions were used as a test data set in the created LDA classification model. The accuracy of the classification model was 77%.

References

[1] Topic model. In Wikipedia, The Free Encyclopedia. Retrieved August, 2019, from https://en.wikipedia.org/wiki/Topic_model
[2] Python Questions from Stack Overflow Retrieved from https://www.kaggle.com/stackoverflow/pythonquestions
[3] Wang, Chong, and David M. Blei. "Collaborative topic modeling for recommending scientific articles." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
[4] Barua, Anton, Stephen W. Thomas, and Ahmed E. Hassan. "What are developers talking about? an analysis of topics and trends in stack overflow." Empirical Software Engineering 19.3 (2014): 619-654.
[5] Bergamaschi, Sonia, Laura Po, and Serena Sorrentino. "Comparing Topic Models for a Movie Recommendation System." WEBIST (2). 2014.
[6] Mimno, David, et al. "Optimizing semantic coherence in topic models." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.
[7] Perplexity To Evaluate Topic Models
Retrieved from http://qpleple.com/perplexity-to-evaluate-topic-models/
[8] tf–idf. In Wikipedia, The Free Encyclopedia. Retrieved August, 2019, from https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[9] Source code. https://github.com/laserwave/plsa/blob/master/plsa.py
[10] Evaluate Topic Models: Latent Dirichlet Allocation (LDA) Retrieved from https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
[11] Binkley, David, et al. "Understanding LDA in source code analysis." Proceedings of the 22nd international conference on program comprehension. ACM, 2014.

Published

2019-12-28

Issue

Section

Electrotechnical and Computer Engineering