Nannuru Jyothirmayee1, Reddy Swetha Harini1, Kaipa Chandana Sree2 and Potireddy Suvarnalatha Devi3
1Department of Applied Microbiology, Sri Padmavati Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India
2Department of Computer Science, Sri Padmavati Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India
3Professor, Department of Applied Microbiology, Sri Padmavati Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India
*Corresponding Author: Potireddy Suvarnalatha Devi, Professor, Department of Applied Microbiology, Sri Padmavati Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India.
Received: August 29, 2024; Published: September 12, 2024
Citation: Potireddy Suvarnalatha Devi., et al. “Artificial Intelligence-Driven Analysis of the Prevalence and Impact of Polycystic Ovary Syndrome (PCOS) Among University Students: A Comprehensive Survey". Acta Scientific Women's Health 6.10 (2024):10-17.
Polycystic Ovary Syndrome is one of the prevalent endocrine disorder affecting women during their reproductive period with a prevalence of 6-20%. It is diagnosed in a women if she meets the conditions of oligomenorrhea or amenorrhea, hyperandrogenism, hirsutism and polycystic ovaries. It is associated with obesity, type 2 diabetes, hypertension, cardiovascular diseases, anxiety, depression, infertility and endometrial cancer. It is often underdiagnosed in women already existing with the disorder which leads to many complications associated with metabolism, hormonal imbalance and psychological problems. A survey was conducted using a google form questionnaire on prevalence and impact of Polycystic Ovarian Syndrome among University students by considering various parameters. The dataset used in this study includes various clinical features such as age, height, menstrual cycle regularity, excessive hair growth, obesity, and other relevant health indicators. To handle the values which are missing and encoding categorical variables along with scaling numerical features, it is important to preprocess the data.. The dataset was then preprocessed and ready for modeling. Since it is a binary classification problem (e.g: PCOS or not), we used suitable algorithms like logistic regression and random forest to train the model. Logistic Regression is one of the linear models that helps in predicting the probability of a binary outcome using one or more variables. Random Forest is defined as an ensemble method that can build multiple decision trees to predict outcomes, outputting the most popular class (classification) or average prediction (regression), developed by Leo Breiman and Adele Cutler. After comparing the performance of both models, we selected the Random Forest algorithm for prediction of PCOS on new data.
Keywords: Oligomenorrhea; Hyperandrogenism; Hirsutism; Polycystic Ovaries; Logistic Regression; Random Forest
AI: Artificial Intelligence; FP: False Positives; FN: False Negatives; GUI: Graphical User Interface; ML: Machine Learning; PCOS: Polycystic Ovary Syndrome; TP: True Positives; TN: True Negatives
Polycystic ovary syndrome (PCOS) is one of the most prevalent reproductive disorders, which is both heterogeneous and endocrine in nature, affecting women during their reproductive years. The prevalence of PCOS is estimated to range from 6% to 20% globally [1,2]. This condition is generally characterized by anovulation, leading to irregular menstrual cycles, and the overproduction of ovarian androgens. Several health implications are associated with PCOS, including obesity and type 2 diabetes, which are often due to insulin resistance. Insulin resistance can stimulate disorders related to hormones and contribute to inflammation and oxidative stress [10,11]. Additionally, women with PCOS are at an increased risk of cardiovascular diseases and other reproductive issues such as infertility, pregnancy complications, miscarriage, and neonatal problems. There is also a heightened risk of developing cancers, particularly of the endometrium and breast. Psychologically, PCOS can lead to anxiety, depression, hypersomnia, and stress [1,12,13]. The diagnostic criteria for PCOS in women typically include the presence of at least two of the following conditions: oligoovulation or anovulation, hyperandrogenism, hirsutism, polycystic ovaries, and elevated levels of androgen [3-5]. While the exact cause of PCOS remains unknown, various genetic and environmental factors are thought to contribute to its development [6]. The prevalence of PCOS varies between countries and is influenced by differences in biochemical and clinical characteristics, which can differ among age groups and ethnicities [7,8]. Notably, around 90% of women with PCOS experience infertility, primarily due to anovulation [9]. Early diagnosis and timely treatment of PCOS are crucial for preventing both short-term and long-term complications associated with the condition [14]. It is estimated that about one-third of women with PCOS experience a delay in diagnosis of more than two years, and this may even be an underestimate [15]. Recent advancements in artificial intelligence (AI) have revolutionized the diagnosis and treatment of diseases, including PCOS [16]. AI is defined as the simulation of human intelligence through computer-based systems [17]. Machine learning (ML), a subset of AI, focuses on learning from past data to make future predictions and decisions [18]. ML techniques can be broadly categorized into two types: supervised and unsupervised learning [19]. Over the past decade, significant breakthroughs in AI and ML have enhanced our ability to diagnose and treat PCOS. AI's capacity to analyze vast amounts of heterogeneous data makes it particularly suitable for diagnosing complex conditions like PCOS [20]. According to the European Society of Human Reproduction and Embryology (ESHRE) and the American Society for Reproductive Medicine (ASRM), the widely accepted Rotterdam criteria, established in 2003, require the presence of at least two of the following three features for a PCOS diagnosis: anovulation or oligoovulation, hyperandrogenism (which can be clinical or biochemical), and the presence of polycystic ovaries as observed via ultrasound [21]. Additionally, the Androgen Excess and PCOS Society defines PCOS as the presence of hyperandrogenism along with either polycystic ovaries or ovarian dysfunction [22]. Recently, the NIH Evidence-Based Methodology Workshop on PCOS endorsed the Rotterdam criteria from 2003, emphasizing the need to identify specific phenotypes within the diagnosis [23,56].
Clinical professionals must rule out other endocrinopathies which mimic PCOS like excess androgenic condition induced by drugs, adrenal hyperplasia-nonclassical, Cushing’s syndrome and tumors that produce androgens in order to correctly diagnose PCOS [55].
Clinical hyperandrogenism in women is commonly manifested by signs such as androgenic alopecia, hirsutism, and acne, which are indicative of elevated androgen levels. Androgenic alopecia, a form of hair loss, presents in a pattern typically observed in males, and can also occur in women, particularly those with hyperandrogenism. Hirsutism, characterized by the excessive growth of terminal hair in a male pattern distribution, is observed in approximately 60-70% of women with Polycystic Ovary Syndrome (PCOS) [24]. In women with PCOS, hirsutism is considered the most reliable clinical marker of hyperandrogenism, commonly assessed using the modified Ferriman-Gallwey (mFG) score [25]. In addition to hirsutism, patients with PCOS frequently report androgenic alopecia and acne as common dermatological manifestations of increased androgen levels [26,27].
The typical menstrual cycle in adult’s spans approximately 28 days, with a normal range of 21 to 35 days. Women with PCOS often experience menstrual irregularities, most commonly presenting as oligo-amenorrhea, defined as infrequent menstruation with cycles longer than 35 days or fewer than eight menstrual periods per year. While less common, polymenorrhea (menstrual cycles shorter than 21 days) is also included in the diagnostic criteria for PCOS as outlined in several clinical guidelines [28-30]. The presence of these irregularities indicates underlying ovulatory dysfunction, a hallmark of PCOS.
The identification of polycystic ovarian morphology (PCOM) has been integral to the diagnosis of PCOS since it was first described by Stein and Leventhal in 1935, based on pathological and surgical assessments that revealed bilaterally enlarged ovaries with a polycystic appearance [31]. According to the 2003 Rotterdam criteria, PCOM is defined as an ovarian volume greater than 10 cm³ in either ovary or the presence of 12 or more follicles measuring 2-9 mm in diameter [32]. Recent updates suggest that a minimum of 25 follicles per ovary and/or an ovarian volume of 10 ml or more are more indicative of PCOS. It is now recommended that follicle count, rather than ovarian volume, be used for a more precise assessment of polycystic ovaries [52].
Androgens play a crucial role in maintaining sexual function, muscle mass, and bone density in women. The primary androgens include testosterone, dihydrotestosterone (DHT), androstenedione (ANSD), dehydroepiandrosterone (DHEA), and dehydroepiandrosterone sulfate (DHEA-S). These hormones are produced by various sources in the body: approximately 25% from the ovaries, 25% from the adrenal glands, and 50% from peripheral tissues [33,34,51]. The ovaries are the main source of testosterone production, while the adrenal glands contribute to the production of DHEA-S. Both the ovaries and adrenal glands are involved in the synthesis of androstenedione (ANSD) [33,34]. In patients with polycystic ovary syndrome (PCOS), the ovaries are the primary cause of hyperandrogenism [51]. Consequently, free testosterone, which is not bound to sex hormone-binding globulin (SHBG), serves as a highly sensitive diagnostic marker for hyperandrogenemia in PCOS patients [35]. Women with PCOS often have reduced levels of SHBG due to the influence of obesity and hyperinsulinemia, both of which are associated with decreased SHBG levels [36]. Studies indicate that elevated levels of free testosterone are found in up to 89% of patients with PCOS and hyperandrogenemia, while elevated total testosterone levels are reported in 49% to 80% of these patients [37-39]. Elevated levels of androstenedione (ANSD) have been found in up to 88% of PCOS patients, potentially increasing the diagnosis rate of hyperandrogenemia by approximately 10% [51].
Adults and adolescents with PCOS are more likely to experience obesity [50]. It remains unclear whether obesity exacerbates PCOS or if PCOS contributes to the development of obesity in women [40]. Multiple studies have shown that more than 60% of patients with PCOS are obese [41]. A study conducted by Glueck., et al. reported that 73% of adolescents with PCOS had a body mass index (BMI) above the 95th percentile [42]. Women with PCOS are also more likely to be overweight or obese, particularly with central or abdominal obesity [43-45]. Moreover, women with PCOS are at a higher risk of developing additional public health issues [52]. These women have higher rates of subclinical hypothyroidism, thyroid autoimmunity, obstructive sleep apnea, and other sleep disorders [46-48]. Psychological conditions, such as anxiety and depression, are also more prevalent among women with PCOS [49]. Insulin resistance is a significant concern, affecting approximately 70% of women with PCOS [53]. According to a meta-analysis, women with PCOS have a higher risk of developing type 2 diabetes mellitus and impaired glucose tolerance compared to women without PCOS [54].
The dataset has been preprocessed to ensure optimal performance for model training. Categorical variables have been encoded, and numerical features have been standardized to bring them to the same scale. Given the binary classification nature of the problem—diagnosed with PCOS (Polycystic Ovary Syndrome) or not—we employed commonly used machine learning algorithms, including Logistic Regression and Random Forest, to build predictive models.
Logistic Regression is a statistical model that employs a logistic function to model a binary dependent variable. Despite its name, it is a linear model for binary classification that estimates the probability of a binary response based on one or more predictor variables. The logistic function maps any real-valued number into a value between 0 and 1, which can then be interpreted as a probability. The performance of the Logistic Regression model was evaluated using several key metrics, yielding the following results:
The model correctly predicted 83.33% of the instances, calculated as the ratio of correctly predicted observations to the total number of observations.
A precision of 33.33% indicates that, out of all the positive predictions made by the model, only 33.33% were actually positive. Precision is defined as the ratio of correctly predicted positive observations to the total predicted positive observations.
A recall of 25.00% indicates that out of all the actual positive instances (cases of PCOS), the model correctly identified 25.00%. Recall is the ratio of correctly predicted positive observations to all the observations in the actual positive class.
The F1 Score, a weighted average of Precision and Recall, is 28.57%. It provides a single metric that balances both Precision and Recall, which is especially useful when the class distribution is imbalanced.
Random Forest is an ensemble learning technique developed by Leo Breiman and Adele Cutler. It operates by constructing multiple decision trees during training and outputs the mode of the classes (in classification tasks) or the mean prediction (in regression tasks) from all the individual trees. This model is particularly robust against overfitting, making it versatile for various classification and regression tasks. The Random Forest model was also evaluated using the same metrics, resulting in the following:
The model correctly classified 90.00% of the instances, indicating high overall predictive performance.
With a precision of 100.00%, the model achieved perfect positive predictive power, meaning all instances predicted as PCOS were correct, with no false positives.
The recall of 25.00% reveals that only 25% of the actual PCOS cases were correctly identified, suggesting a significant number of false negatives.
The F1 Score of 40.00% reflects the harmonic mean of precision and recall, providing a more comprehensive view of the model's performance, especially in scenarios with imbalanced datasets.
A graphical user interface (GUI) was developed to allow users to input their clinical data and receive a prediction on the likelihood of having Polycystic Ovary Syndrome (PCOS). The GUI was constructed using Tkinter, a standard Python interface to the Tk GUI toolkit. Through this interface, users can input various clinical and personal information, including age, height, weight, marital status, menstrual cycle regularity, period length, and additional symptoms relevant to PCOS. The input data is then processed by the predictive model, which analyzes the provided information and offers a prediction regarding the user's likelihood of having PCOS.
The dataset utilized in this study encompasses a range of clinical and personal features that are pertinent to the diagnosis of PCOS. These features include age, height, menstrual cycle regularity, and other health indicators that are clinically relevant for the prediction of PCOS. Initial preprocessing of the dataset was conducted to address missing values, encode categorical variables, and scale numerical features, ensuring that the data was in a suitable format for analysis and model training. The dataset comprises the following features:
To prepare the dataset for model training and ensure its quality, several preprocessing steps were undertaken. These steps were crucial for enhancing the model's predictive performance and accuracy. The following preprocessing steps were performed:
Missing values were addressed either by removing records with incomplete data or by imputing missing values based on statistical methods, such as using the mean, median, or mode, depending on the distribution and nature of the feature.
Categorical variables were transformed into numerical format using techniques such as label encoding or one-hot encoding. This transformation was necessary to convert the non-numerical data into a format suitable for machine learning algorithms.
Numerical features were scaled to a standardized range to ensure uniformity in the dataset. This step was particularly important for machine learning models sensitive to the magnitude of feature values, such as Logistic Regression and Support Vector Machines
The pre-processed dataset was split into training and testing subsets to evaluate the model's performance. The training set was used to train the machine learning models, while the testing set was reserved for evaluating their predictive accuracy and generalizability
Due to the potential class imbalance in the dataset, techniques such as oversampling the minority class or under sampling the majority class were considered. Balancing the dataset was essential to prevent bias in model predictions and to ensure that the models performed well across different classes.
The Random Forest model demonstrated superior performance, achieving an accuracy of 90%, compared to the Logistic Regression model, which had an accuracy of 83.33%. Notably, the precision of the Random Forest model was 100%, indicating that all predicted positive cases were true positives, with no false positives identified. This suggests a high level of specificity in identifying cases of PCOS. However, both models exhibited a low recall rate of 25%, highlighting a significant limitation in their ability to correctly identify all actual cases of PCOS. This low recall rate suggests that a substantial number of actual PCOS cases were missed by both models. If the input values predict the presence of PCOS, the model outputs a result indicating a PCOS case; otherwise, it indicates an outcome of NO PCOS
Although the Random Forest model outperformed the Logistic Regression model in terms of accuracy and precision, both models exhibited a low recall rate, underscoring a need for further optimization to enhance sensitivity. This finding indicates that while the models are effective at predicting positive cases with high precision, they struggle to capture all true positive cases, which is critical for clinical applications. Future work should focus on improving model sensitivity by exploring advanced techniques such as feature engineering to identify the most relevant predictors, hyperparameter tuning to optimize model performance, and incorporating additional data to provide a more comprehensive understanding of the predictors of PCOS. Additionally, employing ensemble methods or deep learning approaches could further enhance model performance and provide more robust predictions.
The authors express their gratitude to the Department of Microbiology and CURIE -AI for support in data processing and student fellowships. We also thank the university students for providing valuable data on Polycystic Ovary Syndrome (PCOS).
No conflict of interest.
Copyright: © 2024 Potireddy Suvarnalatha Devi., et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.