TOPIC MODELING IN TEXT BASED ON TITLES OF DOCUMENTS
DOI:
https://doi.org/10.24867/22BE11LeparKeywords:
topic modeling, LDA, classification, SVM, Naive Bayes, Random ForestAbstract
The paper presents an approach for topic modeling and document classification. Concretely, the paper explores 1) the application of LDA (Latent Dirichlet Allocation) to obtain topics from the text; This approach was evaluated qualitatively, relying on the semantics of the found topics. 2) document classification, where documents were represented using tf-idf features and topics extracted by applying LSA (Latent Semantic Analysis); Naive Bayes classifier was trained on the obtained representation, and F-measure was used for evaluation, 3) document classification, where the tf-idf representation was used to train a classification model, where we experimented with using SVM (Support Vector Machines) and RF (Random Forest) models; In this case, F-measure was used for evaluation.
References
[2] Wiemer-Hastings, P., Wiemer-Hastings, K. and Graesser, A., 2004, November. Latent semantic analysis. In Proceedings of the 16th international joint conference on Artificial intelligence (pp. 1-14).
[3] Krestel, R., Fankhauser, P. and Nejdl, W., 2009. Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 61-68).
[4] Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y. and Zhao, L., 2019. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), pp.15169-15211.
[5] Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.
[6] Rajasundari, T., Subathra, P. and Kumar, P.N., 2017. Performance analysis of topic modeling algorithms for news articles. Journal of Advanced Research in Dynamical and Control Systems, 11, pp.175-183.
[7] Dalal, M.K. and Zaveri, M.A., 2011. Automatic text classification: a technical review. International Journal of Computer Applications, 28(2), pp.37-40.
[8] Sedghpour, A.S. and Sedghpour, M.R.S., 2020. Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis. arXiv preprint arXiv:2006.01715.
[9] Fawagreh, K., Gaber, M.M. and Elyan, E., 2014. Random forests: from early developments to recent advancements. Systems Science & Control Engineering: An Open Access Journal, 2(1), pp.602-609.
[10] Abdelsalam, K. topic_balanced_dataset, Version 1. Retrieved June 13, 2021 from https://www.kaggle.com/karimamd95/topic-balaned-dataset/version/1.