Acta Scientific COMPUTER SCIENCES

Research Article Volume 5 Issue 9

Evaluating Text Preprocessing Methods for Discovering Quality Topics to Improve the Information Retrieval Mechanism

Lakshmi Sonkusale1, Krishna Kumar Chaturvedi2*, Anu Sharma2, Shashi Bhushan Lal2, Mohammad Samir Farooqi3, Achal Lama4, Dwijesh Chandra Mishra4, Pratibha Joshi5, Murari Kumar1

1Ph.D. Scholar, The Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi, India
2Principal Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
3Senior Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
4Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
5Scientist, ICAR-Indian Agricultural Research Institute, New Delhi, India

*Corresponding Author: Krishna Kumar Chaturvedi, Principal Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.

Received: July 24, 2023; Published: August 14, 2023

Abstract

Topic discovery is the innovation towards extracting the underlying semantic structure from large collection of unstructured text. It is a convenient way to analyze unclassified text into topic clusters that can be utilized in classification of documents. A topic contains a set of words that frequently occurs together and defines the complete text into specific category. Topic discovery can group words with similar meaning and distinguish between uses of words with multiple meaning. It is an important and challenging task useful in information retrieval process. This paper discusses different preprocessing methods of text mining by using Latent Dirichlet Allocation (LDA) in determining number of topics. This will help in developing new computational methods to identify topics from text dataset. The LDA is a statistical modelling approach to analyse unclassified text into useful topics. In this study, the effect of text preprocessing methods on collected research articles for obtaining quality topics by applying grid search method for hyperparameters optimization are explored and evaluated using coherence score and topic score. The study suggests that preprocessing affects the number of topics and quality of these topics. The findings of the study will help in enhancing the information retrieval mechanism based of the identified topics and also useful in recommending related research articles to the researchers.

Keywords: Topic Model; Hyperparameters; Topic Discovery; Latent Dirichlet Allocation (LDA); Grid Search

References

  1. Barde BV and Bainwad AM. "An overview of topic modeling methods and tools". In Proceedings of the International Conference on Intelligent Computing and Control Systems (2017): 745-750. IEEE.
  2. Baumer Eric PS., et al. "Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence?". Journal of the Association for Information Science and Technology6 (2017): 1397-1410.
  3. Bellaouar S., et al. "Topic modeling: Comparison of LSA and LDA on scientific publications". 2021 4th International Conference on Data Storage and Data Engineering (2021): 59-64.
  4. Blei D M and Jordan M I. “Modeling annotated data”. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2003): 127-134.
  5. Deerwester S., et al. “Indexing by latent semantic analysis”. Journal of the American Society for Information Science6 (1990): 391-407.
  6. Gupta R K., et al. “Prediction of Research Trends using LDA based Topic Modeling”. Global Transitions Proceedings 3.1 (2022): 298-304.
  7. Hofmann T. “Probabilistic latent semantic indexing”. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999): 50-57.
  8. Hong L and Davison BD. “Empirical study of topic modeling in twitter”. In Proceedings of the first workshop on social media analytics (2010): 80-88.
  9. Hurtado J L., et al. “Topic discovery and future trend forecasting for texts”. Journal of Big Data 1 (2016): 1-21.
  10. Jelodar H., et al. “Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey”. Multimedia Tools and Applications11 (2019): 15169-15211.
  11. Kherwa P and Bansal P. “Topic modeling: a comprehensive review”. EAI Endorsed Transactions on Scalable Information Systems24 (2019).
  12. Lee N., et al. “Combining TF-IDF and LDA to generate flexible communication for recommendation services by a humanoid robot”. Multimedia Tools and Applications4 (2018): 5043-5058.
  13. Mimno D., et al. “Optimizing semantic coherence in topic models”. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (2011): 262-272.
  14. Murakami A., et al. “What is this corpus about?’ using topic modelling to explore a specialised corpus”. Corpora2 (2017): 243-277.
  15. Purver M., et al. “Unsupervised topic modelling for multi-party spoken discourse”. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (2006): 17-24.
  16. Order M., et al. “Exploring the space of topic coherence measures”. In proceedings of the 8th ACM International Conference on Web Search and Data Mining (2015): 399-408.
  17. Sonkusale, L., et al. “Exploring the Applicability of Topic Modeling in SARS-CoV-2 Literature and Impact on Agriculture”. Indian Research Journal of Extension Education 22.4 (2022): 48-56.
  18. Steyvers M and Griffiths T. “Probabilistic topic models”. In Handbook of latent semantic analysis (2007): 439-460.
  19. Syed S and Spruit M. “Full-text or abstract? examining topic coherence scores using latent dirichlet allocation”. In proceedings of the IEEE International Conference on Data Science and Advanced Analytics, (2017): 165-174.
  20. Zhao W., et al. “A heuristic approach to determine an appropriate number of topics in topic modeling”. BMC Bioinformatics 16.13 (2015): 1-10.

Citation

Citation: Krishna Kumar Chaturvedi., et al. “Evaluating Text Preprocessing Methods for Discovering Quality Topics to Improve the Information Retrieval Mechanism".Acta Scientific Computer Sciences 5.9 (2023): 03-08.

Copyright

Copyright: © 2023 Krishna Kumar Chaturvedi., et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.




Metrics

Acceptance rate35%
Acceptance to publication20-30 days

Indexed In




News and Events


  • Certification for Review
    Acta Scientific certifies the Editors/reviewers for their review done towards the assigned articles of the respective journals.
  • Submission Timeline for Upcoming Issue
    The last date for submission of articles for regular Issues is July 10, 2024.
  • Publication Certificate
    Authors will be issued a "Publication Certificate" as a mark of appreciation for publishing their work.
  • Best Article of the Issue
    The Editors will elect one Best Article after each issue release. The authors of this article will be provided with a certificate of "Best Article of the Issue"
  • Welcoming Article Submission
    Acta Scientific delightfully welcomes active researchers for submission of articles towards the upcoming issue of respective journals.

Contact US