Areas of Expertise (5)
Bioinformatics
Data Visualization
Machine Learning
Data Mining
Health Informatics
Biography
Kristin P. Bennett is the Associate Director of the Institute for Data Exploration and Application and a Professor in the Mathematical Sciences and Computer Science Departments and at Rensselaer Polytechnic Institute. Her research focuses on extracting information from data using novel predictive or descriptive mathematical models and data visualizations, and the applications of these methods to support decision making and to accelerate discovery in science, engineering, public health and business. She has 30 years of experience and over 100 publications. As an active member of the machine learning, data mining, and operations research communities, she has served as a study section member for the NIH National Library of Medicine and as associate or guest editors for ACM Transactions on Knowledge Discovery from Data, SIAM Journal on Optimization, Naval Research Logistics, Machine Learning Journal, IEEE Transactions on Neural Networks, and Journal on Machine Learning Research. She served as program chair of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining and She has led many data-driven health-related projects based on machine learning and data mining. These include the award winning “MortaltyMinder app for understanding the social determinants of mortality, the NIH-sponsored “TB-Insight” project which uses bioinformatics to track and control tuberculosis, emergency department revisits for Albany Medical Center, and a project on high needs patient management for Capital District Physicians Health Plan, and the privacy preserving synthetic health data project for United Health Foundation, and subpopulation risk analysis method for IBM. She also led projects on sensor-based anomaly detection for industrial partners including GLOBALFOUNDRY and GE Renewables. She founded and directs the Data Interdisciplinary Challenges Intelligent Technology Exploration Laboratory (Data INCITE Lab.) which is pioneering highly effective new approaches for data analytics in undergraduate education with sponsorship from NIH, United Health Foundation, and industrial partners. In the Data INCITE Lab, undergraduate and graduate students tackle open applied data analytics problems contributed by industry, foundations, and researchers.
Media
Publications:
Documents:
Photos:
Audio/Podcasts:
Education (3)
University of Wisconsin, Madison: Ph.D., Computer Sciences 1993
University of Wisconsin, Madison: M.S. 1989
University of Puget Sound: B.S. 1985
Links (5)
Media Appearances (3)
How this Albany health insurer will start using artificial intelligence
Albany Business Review print
2019-07-02
CDPHP is working with Kristin Bennett at Rensselaer Polytechnic Institute to use artificial intelligence to figure out which patients could benefit from more personalized care.
CDPHP researching better health care through AI
The Daily Gazette print
2019-07-05
Collaboration with Kristin Bennett at Rensselaer Polytechnic Institute seeks to help policy holders with greatest needs.
We got an IDEA, actually we got lots of ideas – Part II @RPI
Data Science Imposters
In this episode, Dr. Bennett takes us back to school and teaches us a few things about machine learning, artificial intelligence, data analytics, and visualization. Along the way, we discuss how to incorporate teaching of these topics in colleges and high schools and some of the moral issues that may arise with artificial intelligence.
Articles (3)
Identifying Windows of Susceptibility by Temporal Gene Analysis
Scientific ReportsKristin P Bennett, Elisabeth M Brown, Hannah De los Santos, Matthew Poegel, Thomas R Kiehl, Evan W Patton, Spencer Norris, Sally Temple, John Erickson, Deborah L McGuinness, Nathan C Boles
2019 Increased understanding of developmental disorders of the brain has shown that genetic mutations, environmental toxins and biological insults typically act during developmental windows of susceptibility. Identifying these vulnerable periods is a necessary and vital step for safeguarding women and their fetuses against disease causing agents during pregnancy and for developing timely interventions and treatments for neurodevelopmental disorders. We analyzed developmental time-course gene expression data derived from human pluripotent stem cells, with disease association, pathway, and protein interaction databases to identify windows of disease susceptibility during development and the time periods for productive interventions. The results are displayed as interactive Susceptibility Windows Ontological Transcriptome (SWOT) Clocks illustrating disease susceptibility over developmental time. Using this method, we determine the likely windows of susceptibility for multiple neurological disorders using known disease associated genes and genes derived from RNA-sequencing studies including autism spectrum disorder, schizophrenia, and Zika virus induced microcephaly. SWOT clocks provide a valuable tool for integrating data from multiple databases in a developmental context with data generated from next-generation sequencing to help identify windows of susceptibility.
Biases in Feature Selection With Missing Data
NeurocomputingBorja Seijo-Pardo, Amparo Alonso-Betanzos, Kristin P Bennett, Verónica Bolón-Canedo, Julie Josse, Mehreen Saeed, Isabelle Guyon
2019 Feature selection is of great importance for two possible scenarios: (1) prediction, i.e., improving (or minimally degrading) the predictions of a target variable while discarding redundant or uninformative features and (2) discovery, i.e., identifying features that are truly dependent on the target and may be genuine causes to be determined in experimental verifications (for example for the task of drug target discovery in genomics). In both cases, if variables have a large number of missing values, imputing them may lead to false positives; features that are not associated with the target become dependent as a result of imputation. In the first scenario, this may not harm prediction, but in the second one, it will erroneously select irrelevant features. In this paper, we study the risk/benefit trade-off of missing value imputation in the context of feature selection, using causal graphs to characterize when structural bias arises. Our aim is also to investigate situations in which imputing missing values may be beneficial to reduce false negatives, a situation that might arise when there is a dependency between feature and target, but the dependency is below the significance level when only complete cases are considered. However, the benefits of reducing false negatives must be balanced against the increased number of false positives. In the case of binary target variable and continuous features, the t-test is often used for univariate feature selection. In this paper, we also introduce a de-biased version of the t-test allowing us to reap the benefits of imputation, while not incurring the penalty of increasing the number of false positives.
A Precision Environment-Wide Association Study of Hypertension via Supervised Cadre Models
IEEE Journal of Biomedical and Health InformaticsAlexander New, Kristin P. Bennett
We consider the problem in precision health of grouping people into subpopulations based on their degree of vulnerability to a risk factor. These subpopulations cannot be discovered with traditional clustering techniques because their quality is evaluated with a supervised metric: the ease of modeling a response variable for observations within them. Instead, we apply the more appropriate supervised cadre model (SCM). We extend the SCM formalism so that it may be applied to multivariate regression and binary classification problems and develop a way to use conditional entropy to assess the confidence in the process by which a subject is assigned their cadre. Using the SCM, we generalize the environment-wide association study (EWAS) to be able to model heterogeneity in population risk. In our EWAS, we consider more than two hundred environmental exposure factors and find their association with diastolic blood pressure, systolic blood pressure, and hypertension. This requires adapting the SCM to be applicable to data generated by a complex survey design. After correcting for false positives, we found 25 exposure variables that had a significant association with at least one of our response variables. Eight of these were significant for a discovered subpopulation but not for the overall population. Some of these associations have been identified by previous researchers, while others appear to be novel. We examine discovered subpopulations in detail, finding that they are interpretable and suggestive of further research questions.