Evaluating Text Preprocessing Methods for Discovering Quality Topics to
Improve the Information Retrieval Mechanism
Lakshmi Sonkusale1, Krishna Kumar Chaturvedi2*, Anu Sharma2, Shashi Bhushan Lal2, Mohammad Samir Farooqi3, Achal Lama4, Dwijesh Chandra Mishra4, Pratibha Joshi5, Murari Kumar1
1Ph.D. Scholar, The Graduate School, ICAR-Indian Agricultural Research Institute,
New Delhi, India
2Principal Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
3Senior Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
4Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
5Scientist, ICAR-Indian Agricultural Research Institute, New Delhi, India
*Corresponding Author: Krishna Kumar Chaturvedi, Principal Scientist, ICAR-Indian
Agricultural Statistics Research Institute, New Delhi, India.
July 24, 2023; Published: August 14, 2023
Topic discovery is the innovation towards extracting the underlying semantic structure from large collection of unstructured text. It
is a convenient way to analyze unclassified text into topic clusters that can be utilized in classification of documents. A topic contains
a set of words that frequently occurs together and defines the complete text into specific category. Topic discovery can group words
with similar meaning and distinguish between uses of words with multiple meaning. It is an important and challenging task useful
in information retrieval process. This paper discusses different preprocessing methods of text mining by using Latent Dirichlet Allocation
(LDA) in determining number of topics. This will help in developing new computational methods to identify topics from text
dataset. The LDA is a statistical modelling approach to analyse unclassified text into useful topics. In this study, the effect of text preprocessing
methods on collected research articles for obtaining quality topics by applying grid search method for hyperparameters
optimization are explored and evaluated using coherence score and topic score. The study suggests that preprocessing affects the
number of topics and quality of these topics. The findings of the study will help in enhancing the information retrieval mechanism
based of the identified topics and also useful in recommending related research articles to the researchers.
Keywords: Topic Model; Hyperparameters; Topic Discovery; Latent Dirichlet Allocation (LDA); Grid Search
- Barde BV and Bainwad AM. "An overview of topic modeling methods and tools". In Proceedings of the International Conference on Intelligent Computing and Control Systems (2017): 745-750. IEEE.
- Baumer Eric PS., et al. "Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence?". Journal of the Association for Information Science and Technology6 (2017): 1397-1410.
- Bellaouar S., et al. "Topic modeling: Comparison of LSA and LDA on scientific publications". 2021 4th International Conference on Data Storage and Data Engineering (2021): 59-64.
- Blei D M and Jordan M I. “Modeling annotated data”. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2003): 127-134.
- Deerwester S., et al. “Indexing by latent semantic analysis”. Journal of the American Society for Information Science6 (1990): 391-407.
- Gupta R K., et al. “Prediction of Research Trends using LDA based Topic Modeling”. Global Transitions Proceedings 3.1 (2022): 298-304.
- Hofmann T. “Probabilistic latent semantic indexing”. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999): 50-57.
- Hong L and Davison BD. “Empirical study of topic modeling in twitter”. In Proceedings of the first workshop on social media analytics (2010): 80-88.
- Hurtado J L., et al. “Topic discovery and future trend forecasting for texts”. Journal of Big Data 1 (2016): 1-21.
- Jelodar H., et al. “Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey”. Multimedia Tools and Applications11 (2019): 15169-15211.
- Kherwa P and Bansal P. “Topic modeling: a comprehensive review”. EAI Endorsed Transactions on Scalable Information Systems24 (2019).
- Lee N., et al. “Combining TF-IDF and LDA to generate flexible communication for recommendation services by a humanoid robot”. Multimedia Tools and Applications4 (2018): 5043-5058.
- Mimno D., et al. “Optimizing semantic coherence in topic models”. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (2011): 262-272.
- Murakami A., et al. “What is this corpus about?’ using topic modelling to explore a specialised corpus”. Corpora2 (2017): 243-277.
- Purver M., et al. “Unsupervised topic modelling for multi-party spoken discourse”. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (2006): 17-24.
- Order M., et al. “Exploring the space of topic coherence measures”. In proceedings of the 8th ACM International Conference on Web Search and Data Mining (2015): 399-408.
- Sonkusale, L., et al. “Exploring the Applicability of Topic Modeling in SARS-CoV-2 Literature and Impact on Agriculture”. Indian Research Journal of Extension Education 22.4 (2022): 48-56.
- Steyvers M and Griffiths T. “Probabilistic topic models”. In Handbook of latent semantic analysis (2007): 439-460.
- Syed S and Spruit M. “Full-text or abstract? examining topic coherence scores using latent dirichlet allocation”. In proceedings of the IEEE International Conference on Data Science and Advanced Analytics, (2017): 165-174.
- Zhao W., et al. “A heuristic approach to determine an appropriate number of topics in topic modeling”. BMC Bioinformatics 16.13 (2015): 1-10.