AUTOMATIC BOOK TOPIC DETECTION USING NATURAL LANGUAGE PROCESSING
DOI:
https://doi.org/10.24867/06BE45DjurdjevicKeywords:
Latent Dirichlet Allocation, Named Entity RecognitionAbstract
This paper presents a performance analysis of an LDA model created for determining topics from a book corpus. A detailed analysis of four crucial steps regarding the implementation of the model is presented, data preprocessing, NER method, determining the optimal number of topics and choosing the best implementation algorithm. For each of the steps, a number of different methods for overcoming the problems that arise are demonstrated. The obtained results for each of the different methods are presented and discussed in detail. Finally, the optimal method is chosen to be a part of the resulting model.
References
[1] O. Hrnjaković, V. Đurđević, D. Bujiša, Predikcija popularnosti knjiga, Fakultet tehničkih nauka, Novi Sad, 2019
[2] Goodreads. (2018). [online] Dostupno na: https://www.goodreads.com/
[3] J. Millar, G. Peterson, M. Mendenhall, Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps, Air Force Institute of Technology, 2009
[4] S. Crossley, M. Dascalau, D. McNamara, How Important Is Size? An Investigation of Corpus Size and Meaning in both Latent Semantic Analysis and Latent Dirichlet Allocation
[5] D. Alvarez-Melis, M. Saveski, Topic Modeling in Twitter: Aggregating Tweets by Conversations, Massachusetts Institute of Technology, Cambridge, MA, USA, 2016
[6] W. Zhao, J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to determine an appropriate number of topics in topic modeling, 2015
[7] J. Murdock, C. Allen, Visualization Techniques for Topic Model Checking, Program in Cognitive Science, Indiana University, USA
[8] M. Roder, A. Both, A. Hinneburg, Exploring the Space of Topic Coherence Measures, Leipzig University, R&D, Unister GmbH, Martin-Luther University, Germany
[2] Goodreads. (2018). [online] Dostupno na: https://www.goodreads.com/
[3] J. Millar, G. Peterson, M. Mendenhall, Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps, Air Force Institute of Technology, 2009
[4] S. Crossley, M. Dascalau, D. McNamara, How Important Is Size? An Investigation of Corpus Size and Meaning in both Latent Semantic Analysis and Latent Dirichlet Allocation
[5] D. Alvarez-Melis, M. Saveski, Topic Modeling in Twitter: Aggregating Tweets by Conversations, Massachusetts Institute of Technology, Cambridge, MA, USA, 2016
[6] W. Zhao, J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to determine an appropriate number of topics in topic modeling, 2015
[7] J. Murdock, C. Allen, Visualization Techniques for Topic Model Checking, Program in Cognitive Science, Indiana University, USA
[8] M. Roder, A. Both, A. Hinneburg, Exploring the Space of Topic Coherence Measures, Leipzig University, R&D, Unister GmbH, Martin-Luther University, Germany
Downloads
Published
2019-12-30
Issue
Section
Electrotechnical and Computer Engineering