Acta Scientific Medical Sciences (ASMS)(ISSN: 2582-0931)

Research Article Volume 6 Issue 2

Assessing the Impact of Unbalance in Data on Predicting Breast Cancer Occurrence Using Machine Learning Models

Shuning Yin, Raji Sundararajan* and Gaurav Nanda

School of Engineering Technology, Purdue University, West Lafayette, USA

*Corresponding Author: Raji Sundararajan, School of Engineering Technology, Purdue University, West Lafayette, USA.

Received: December 03, 2021; Published: January 25, 2022


With over 2 million new cases each year with over 600,000 deaths, breast cancer is the most common cancer among women all over the world. Early detection and treatment can save 1000s of lives. To achieve this goal, we used the machine learning library, WEKA and the Breast Cancer Surveillance Consortium (BCSC) dataset with 154,899 screening records. This dataset has twelve variables, and we considered the variable “breast_cancer_history”, as the main variable to be predicted. The various machine learning (ML) classifiers examined include Naïve Bayes, Logistic Regression, Multilayer Perceptron, and Support Vector Machine with five different test conditions. Since the dataset had less cancer cases (“Class 1”), compared to non-cancer (heavily unbalanced with very high percentage of “Class 0”) cases, we examined the prediction performance of these algorithms on “balanced” and “unbalanced” datasets. Stratified sampling method was used to create unbalanced and balanced datasets. The unbalanced dataset included all the original screening data (154,899 cases) and the balanced dataset included the same number of cases for both class 0 and class 1 (13,279 each). Of the four ML classifiers, Multilayer Perceptron had the best predicting performance for both unbalanced and balanced dataset. Overall, the balanced dataset had better prediction results for all four classifiers than the unbalanced dataset. For medical decision support purposes, we can use the prediction outputs from the ML models, trained on both balanced and unbalanced training data.

Keywords: Breast Cancer; Machine Learning; Data Analysis; WEKA; Prediction


  1. “Breast Cancer Statistics”. Centers for Disease Control and Prevention (2021).
  2. Siegel R L., et al. “Cancer Statistics, 2015”. CA: A Cancer Journal for Clinicians 1 (2015): 5-29.
  3. Siegel R L., et al. “Cancer Statistics, 2016”. CA: A Cancer Journal for Clinicians 1 (2016): 7-30.
  4. S. Breast Cancer Statistics (2021).
  5. Siegel R L., et al. “Cancer Statistics, 2019”. CA: A Cancer Journal for Clinicians1 (2019): 7-34.
  6. Sun Y., et al. “Risk Factors and Preventions of Breast Cancer”. International Journal of Biological Science11 (2017): 1387-1397.
  7. Al-Khasawneh A. “Diagnosis of Breast Cancer Using Intelligent Information Systems Techniques”. International Journal of E-Health and Medical Communications1 (2016): 65-75.
  8. Listgarten J., et al. “Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms”. Clinical Cancer Research8 (2004): 2725-2737.
  9. Chen Y C., et al. “Risk classification of cancer survival using ANN with gene expression data from multiple laboratories”. Computers in Biology and Medicine 48 (2014): 1-7.
  10. Akay M F. “Support vector machines combined with feature selection for breast cancer diagnosis”. Expert Systems with Applications2 (2009): 3240-3247.
  11. Park K., et al. “Robust predictive model for evaluating breast cancer survivability”. Engineering Applications of Artificial Intelligence9 (2013): 2194-2205.
  12. Hastie T. “The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd Edition)”. Springer New York (2009).
  13. Kononenko I. “Machine learning for medical diagnosis: History, state of the art and perspective”. Artificial Intelligence in Medicine1 (2001): 89-109.
  14. Cruz J A., et al. “Applications of Machine Learning in Cancer Prediction and Prognosis”. Cancer Informtics 2 (2006): 59-77.
  15. Pironet A., et al. “Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports”. Frontiers in Digital Health 3 (2021): 102.
  16. Ming C., et al. “Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models”. Breast Cancer Research1 (2019).
  17. Asri H., et al. “Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis”. Procedia Computer Science 83 (2016): 1064-1069.
  18. Li M., et al. “Machine Learning-Based Decision Support System for Early Detection of Breast Cancer”. Indian Journal of Pharmaceutical Education and Research3 (2020): 705-715.
  19. Ahmad L G., et al. “Using Three Machine Learning Techniques for Predicting Breast Bancer Recurrence”. Journal of Health and Medical Informatics2 (2013): 2.
  20. Breast Cancer Surveillance Consortium. (HHSN261201100031C) (2021).
  21. Rajendran K., et al. “Predicting Breast Cancer via Supervised Machine Learning Methods on Class Imbalanced Data”. International Journal of Advanced Computer Science and Applications8 (2020): 54-63.
  22. Mitchell T. “Machine Learning”. McGraw-Hill (1997).
  23. Park Y S and Lek S. “Chapter 7 - Artificial Neural Networks: Multilayer Perceptron for Ecological Modeling”. In S. E. Jorgensen (Eds), Developments in Environmental Modelling 28 (2016): 123-140.
  24. Gupta P and Sinha N K. “CHAPTER 14 - Neural Networks for Identification of Nonlinear Systems: An Overview”. In N. K. Sinha and M. M. Gupta (Eds), Soft Computing and Intelligent Systems (2000): 337-356.
  25. Ahmad L G., et al. “Using Three Machine Learning Techniques for Predicting Breast Bancer Recurrence”. Journal of Health and Medical Informatics2 (2013): 2.
  26. Delen D., et al. “Predicting breast cancer survivability: a comparison of three data mining methods”. Artificial Intelligence in Medicine2 (2005): 113-127.


Citation: Shuning Yin., et al. “Assessing the Impact of Unbalance in Data on Predicting Breast Cancer Occurrence Using Machine Learning Models”.Acta Scientific Medical Sciences 6.2 (2022): 159-170.


Copyright: © 2022 Shuning Yin., et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Acceptance rate30%
Acceptance to publication20-30 days
Impact Factor1.403

Indexed In

News and Events

  • Certification for Review
    Acta Scientific certifies the Editors/reviewers for their review done towards the assigned articles of the respective journals.
  • Submission Timeline for Upcoming Issue
    The last date for submission of articles for regular Issues is July 10, 2022.
  • Publication Certificate
    Authors will be issued a "Publication Certificate" as a mark of appreciation for publishing their work.
  • Best Article of the Issue
    The Editors will elect one Best Article after each issue release. The authors of this article will be provided with a certificate of “Best Article of the Issue”.
  • Welcoming Article Submission
    Acta Scientific delightfully welcomes active researchers for submission of articles towards the upcoming issue of respective journals.
  • Contact US