AUTOMATIC BOOK TOPIC DETECTION USING NATURAL LANGUAGE PROCESSING

Authors

  • Vlada Đurđević Autor

DOI:

https://doi.org/10.24867/06BE45Djurdjevic

Keywords:

Latent Dirichlet Allocation, Named Entity Recognition

Abstract

This paper presents a performance analysis of an LDA model created for determining topics from a book corpus. A detailed analysis of four crucial steps regarding the implementation of the model is presented, data preprocessing, NER method, determining the optimal number of topics and choosing the best implementation algorithm. For each of the steps, a number of different methods for overcoming the problems that arise are demonstrated. The obtained results for each of the different methods are presented and discussed in detail. Finally, the optimal method is chosen to be a part of the resulting model.

References

[1] O. Hrnjaković, V. Đurđević, D. Bujiša, Predikcija popularnosti knjiga, Fakultet tehničkih nauka, Novi Sad, 2019
[2] Goodreads. (2018). [online] Dostupno na: https://www.goodreads.com/
[3] J. Millar, G. Peterson, M. Mendenhall, Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps, Air Force Institute of Technology, 2009
[4] S. Crossley, M. Dascalau, D. McNamara, How Important Is Size? An Investigation of Corpus Size and Meaning in both Latent Semantic Analysis and Latent Dirichlet Allocation
[5] D. Alvarez-Melis, M. Saveski, Topic Modeling in Twitter: Aggregating Tweets by Conversations, Massachusetts Institute of Technology, Cambridge, MA, USA, 2016
[6] W. Zhao, J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to determine an appropriate number of topics in topic modeling, 2015
[7] J. Murdock, C. Allen, Visualization Techniques for Topic Model Checking, Program in Cognitive Science, Indiana University, USA
[8] M. Roder, A. Both, A. Hinneburg, Exploring the Space of Topic Coherence Measures, Leipzig University, R&D, Unister GmbH, Martin-Luther University, Germany

Published

2019-12-30

Issue

Section

Electrotechnical and Computer Engineering