CIS 691 – Medical and Bioinformatics Capstone Presentations

April 20, 2018, 1-4 pm Eberhard Center 414

Identifying Features That Impact Diabetes Mellitus Readmission Rates

Omotayo E. Ajileye

PURPOSE: Diabetes mellitus (DM) is a growing burden in the United States. Readmission of patients with diabetes has generated more concern in healthcare. The common predictors of readmission among the inpatient diabetes cohort include racial and socioeconomic factors, non-diabetes related co-morbidities, and failure to acknowledge diabetes at discharge. The goal of this paper is to identify the possible predictors for early and late readmissions of diabetes.

PROCEDURE:The dataset used for this paper was downloaded from the UCI Machine learning repository. There were several preprocessing steps performed on the data which reduced the features from 55 to 45 and the observations from 101,766 to 71,048 unique observations.The machine learning models used were Logistic regression and Random forest.

OUTCOME:The predictors identified for early readmission are admission type, discharge disposition, admission source, medical specialty, number of emergency visits, number of inpatient visits, primary, secondary and tertiary diagnosis, number of diagnosis, insulin intake, and diabetic medication. The predictors identified for late readmission are the same as early readmission with some additional predictors which are race, age, time in hospital, number of procedures, number of outpatient visits, A1c results and acarbose drug. There was no significant difference in the strength of this predictors comparing the common predictors for the early and late readmissions.

For the logistic regression model, the prediction accuracy for early and late readmission were 87% and 67% respectively. For Random forest model, the prediction accuracy for early and late readmission were 72% and 61% respectively. It was concluded that logistic regression model performed best in predicting both early and late readmission.

IMPACT: This project identifies the useful predictor of readmission rates which may prove valuable in the development of strategies to reduce readmission rates and costs for the care of individuals with diabetes mellitus.

◄►◄►◄►◄►◄►

Identifying and Predicting Areas of Increasing Heart Disease Mortality in the United States

Jacob D. Bourgeois

PURPOSE: Heart disease is currently the leading cause of mortality in the United States, with over 600,00 deaths per year. However, the age-adjusted mortality rate per 100,000 personsfor heart disease has been steadily decreasing since the 1950s. It is expected that nationwide mortality rates for the diseaseshould continue to decrease throughout all age groups, genders and races. Yet, little attention has been given to changes at the US county level, where heart disease mortalities may remain constant or even increase in specific regions.

PROCEDURES: Data was collected from the Center of Disease Control and Prevention’s Heart Disease Maps and Data Sources website from 2006 through 2014. Heart disease deaths were defined according to the International Classification of Diseases codes for diseases of the heart in the tenth revisions of the International Classification of Diseases. A multiple linear regression model was developed using mortality rates as the response variable and the observed year and US county as the explanatory variables to determine areas where heart disease is continued to raise.

OUTCOME: 704 counties were identified to have mortality rates that were heavily predicated on the year of observance. Of those counties identified, 322 of those counties are expected to see an average yearly increase of 4.54 mortalities per 100,00 persons.

IMPACT: Despite the dramatic declines in heart disease mortality in the United States at the national level, several areas will continue to see stagnant or increasing death rates. This demonstrates the importance of increasing focus on small-area surveillance to reveal trends that are otherwisemasked at the national level. It also gives those areas historical context and clues for understanding their current heart disease mortality rates and disparity in relation to other communities.

◄►◄►◄►◄►◄►

Impact of EHR Usability on Patient-Provider Relationships and Health Outcomes- A Literature Review

Jamie Cole

PURPOSE: Healthcare organizations may reap substantial benefits when transitioning to electronic health records (EHRs), such as decreased healthcare costs and better care. However, severe unintended consequences from the implementation and design of these systems have emerged. Poorly implemented EHR systems may endanger the integrity of clinical or administrative data. That, in turn, can lead to errors that may jeopardize patient safety or decrease quality of care. Adding poor design quality of EHRs can significantly increase the mental workload of clinicians, thereby increasing frustration, reducing user satisfaction, and causing unproductive workarounds.

METHODS AND MATERIALS:A literature review from over 300 sources identified how EHR implementation and design can impact the workload of healthcare providers, patient-provider relationships, and health outcomes. Additional research of EHR impact on patient safety, quality of care, and care coordination was conducted to assess contributing factors to these outcomes.

ANALYSIS:Our systematic literature review included PubMed, ProQuest, and Google Scholar databases. The search terms included "electronic health records," "EHR usability," "EHR alert fatigue," "EHR workarounds," and "EHR patient safety”. As a synonym for EHR, electronic medical records (EMRs) were used interchangeably with the above search terms. Our search focused on case studies and experimental results rather than overview papers. After we consolidated multiple copies and reviewed all articles for relevance to our goals,our collection included over 300 published articles which formed the basis for our investigation.

IMPACT: This review adds to the effort on evaluating the impact of EHR usability on patient-provider relationships and health outcomes.

◄►◄►◄►◄►◄►

Prediction Comparative Study on Cervical Cancer Analysis In Women Using Machine Learning Algorithms

UpendraKhimavath

PURPOSE: The purpose of my capstone project is to predict which age group is likely to get cervical cancer in women and comparative study of cervical cancer analysis in women using machine learning algorithms.

METHODS AND MATERIALS:Cervical cancer arises from transformation of normal cells into tumor cells in a multistage process that generally progresses from a pre-cancerous lesion to a malignant tumor. The dataset has been obtained from UCI Machine Learning repository. The dataset consists of demographic information, habits, and historical medical records of 858 observations and 36 variables. The models used for this analysis are firth logistic regression for prediction of cervical cancer and comparative model using linear discriminant analysis and k-means clustering.

ANALYSIS:Data were cleaned and Boruta analysis was done for a selection of variables. Firth Logistic Regression was used for the prediction and model fitting by penalized maximum likelihood estimation, which helps to count for high volumes of zero in the data. For a one unit (or year) increase in age, a patient is 0.756 times less likely to have cervical cancer. Linear discriminant analysis, Decision tree, and k means clustering models was performed to detect the accuracy of the models.

◄►◄►◄►◄►◄►

Leading Cause of Death in the US - Prediction and Visual Analysis

Pavan K.Komma

PURPOSE:The prime purpose of the project is to provide visual analysis of diseases/human condition related deaths among the adults in the continental United States and provide accurate prediction for the future using Predictive Analytics and Machine Learning. The goals of the current project are accomplished by developing Shiny based interactive application to provide visualizations and predictions at the same time for nearly 13 diseases/conditions across all 50 States.

PROCEDURE:Shiny bases application was developed, and Machine Learning methodologies are used to train model to predict the outcome (as percentage rate per 100,000 individuals). Visual analysis was done in the form of simple line graphs using grammar of graphics library (ggplot). This library was considered due to its extensive usability and not very complicated methods of visualizations.

OUTCOME: The project outcome provides insights for people into the trends and make them learn about leading causes of death and possibly improve healthy living of the people so as to stay away from the risk and take necessary steps to improve healthy living. The project also provides trend analysis along with prediction of the near future based on the previous trends in that state/Region. Therefore, health department of that states can take necessary actions to improve the well being of their population and provide them with some initiatives (Ex: Healthy People 2020) to combat the risk of mortality through these diseases.

IMPACT: This project is closely associated with the Health Informatics with emphasis on Public Health Informatics. It helps people learn about incidence and severity of various diseases and gives an idea to the practitioner, leadership personnel, and health departments of states in the severity of diseases for the near future by Predictive Analytics and Machine Learning which gives nearly accurate predictions of various diseases and help them make appropriate decisions depending on the severity to be prepared to tackle the issue.

◄►◄►◄►◄►◄►

Comparison of Machine Learning Algorithms on Mental Health Survey Data

Vyshnavi P. Kotla

OBJECTIVE: The objective of the study is to compare machine learning algorithms using mental health survey data and analyze the mental health condition of employees at the workplace (technology companies vs.non-technology companies).

Background: In currently developing high technology world the most global problem observed is mental illness. Its prevalence is critical and leads to major health outcomes. In this study, it was investigated if the work place (technology company and non-technology company) affects the mental health conditions of the employees. And models were built to predict the accuracy of the mental health conditions using machine learning techniques with an interactive visualization of the data.

METHODS: ‘R’ software models were used to develop and predict the analyses of mental health issues at work place. The techniques used were data mining and machine learning to analyze using Tree classifiers, Recursive Partitioning, Random Forest, Bagging, Artificial Neural Networks, Naive Bayes, and Support Vector Machine techniques. Tableau was used to develop the interactive and graphical view of the data.

RESULTS: I have found no difference between the employees who work at technology companies and non-technology companies using the statistical methods,in which a P value <0.05 was considered significant. The model’s accuracy was ranging about 75% - 78% for a variety of models.

CONCLUSION: From this analysis, it was found that mental health conditions are not affected by their work place (technology companies and non-technology). The machine learning techniques used in this study tells us about the accuracy of the model’s predicted for mental health issues.

◄►◄►◄►◄►◄►

Comparison of Supervised Machine Algorithms by Classifying a Cardiotocography Dataset

Shreya S.Paithankar

PURPOSE: To compare the performance and visualize the results of five different Supervised Machine Learning algorithms by classifying Cardiotocography dataset.

SUBJECTS: Cardiotocography is a technique to record the fetal heart rate and uterine contractions during pregnancy to examine the maternal and fetal health status. The UCI Machine Learning Repository Cardiotocography dataset contains 2126 automatically processed cardiotocograms with 21 attributes. The two-way classification of the dataset as 10-class morphological patterns and 3-class fetal status was done by three expert obstetricians. The 10-class classification was attempted in this project.

METHODS AND MATERIALS: Five different classification models based on Recursive Partitioning, Random Forest, Conditional Inference Trees, Linear Discriminant Analysis and Naïve Bayes were built. 70-30% data-splitting was used for Training-Testing process. The performances of models’ were compared in terms of accuracy and Kappa value. Confusion-matrices were converted to heat map for visual assessment of individual model performance. Visual comparison of models was done by plotting class mismatch percentages across every model. R statistical programming and Tableau software were used for model building and visualization respectively.

RESULTS: RandomForest model shown highest accuracy (86%) and kappa (.84) whereas Naive Bayes model showed lowest accuracy (55%) and Kappa (0.49). Heat map visualization of individual algorithms and class-wise mismatch percentages of every model aided in the analysis.

CONCLUSION: RandomForest algorithm has potential to classify future cardiotocography datasets. Visualization techniques such as Heatmap and Mismatch plotting should be considered while assessing the performance of the multi-class classifier.

1