Skip to main navigation menu Skip to main content Skip to site footer

Electrotechnical and Computer Engineering

Vol. 38 No. 03 (2023): Proceedings of Faculty of Technical Sciences

TOPIC MODELING IN TEXT BASED ON TITLES OF DOCUMENTS

DOI:
https://doi.org/10.24867/22BE11Lepar
Submitted
October 26, 2022
Published
2023-03-05

Abstract

The paper presents an approach for topic modeling and document classification. Concretely, the paper explores 1) the application of LDA (Latent Dirichlet Allocation) to obtain topics from the text; This approach was evaluated qualitatively, relying on the semantics of the found topics. 2) document classification, where documents were represented using tf-idf features and topics extracted by applying LSA (Latent Semantic Analysis); Naive Bayes classifier was trained on the obtained representation, and F-measure was used for evaluation, 3) document classification, where the tf-idf representation was used to train a classification model, where we experimented with using SVM (Support Vector Machines) and RF (Random Forest) models; In this case, F-measure was used for evaluation.

References

[1] Evangelopoulos, N., Zhang, X. and Prybutok, V.R., 2012. Latent semantic analysis: five methodological recommendations. European Journal of Information Systems, 21(1), pp.70-86.
[2] Wiemer-Hastings, P., Wiemer-Hastings, K. and Graesser, A., 2004, November. Latent semantic analysis. In Proceedings of the 16th international joint conference on Artificial intelligence (pp. 1-14).
[3] Krestel, R., Fankhauser, P. and Nejdl, W., 2009. Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 61-68).
[4] Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y. and Zhao, L., 2019. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), pp.15169-15211.
[5] Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.
[6] Rajasundari, T., Subathra, P. and Kumar, P.N., 2017. Performance analysis of topic modeling algorithms for news articles. Journal of Advanced Research in Dynamical and Control Systems, 11, pp.175-183.
[7] Dalal, M.K. and Zaveri, M.A., 2011. Automatic text classification: a technical review. International Journal of Computer Applications, 28(2), pp.37-40.
[8] Sedghpour, A.S. and Sedghpour, M.R.S., 2020. Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis. arXiv preprint arXiv:2006.01715.
[9] Fawagreh, K., Gaber, M.M. and Elyan, E., 2014. Random forests: from early developments to recent advancements. Systems Science & Control Engineering: An Open Access Journal, 2(1), pp.602-609.
[10] Abdelsalam, K. topic_balanced_dataset, Version 1. Retrieved June 13, 2021 from https://www.kaggle.com/karimamd95/topic-balaned-dataset/version/1.