When: May 20, 12:00

SpeakerUri Shalit (Technion)
Join Zoom Meetinglink
Recording of the talklink
Title: Individualized treatment recommendations from observational health data: challenges and proposed best practices
Abstract: One of the most inspiring promises of using machine learning in healthcare is learning how to optimally treat individual patients based on data from past patients. I will discuss the challenges that come up when addressing this task, and why standard machine learning methods can catastrophically fail. I will then propose best-practices based on ideas from causal inference, along with the necessary identification assumptions for learning treatment recommendations. I will present two case studies: one dealing with treatment of chronic disease using data from a large health provider, and one dealing with acute care using data from a university hospital.

When: May 27, 15:00

SpeakerHaim Bar (University of Connecticut)
Join Zoom Meetinglink
Title: Large-P Variable Selection in Two-Stage Models (link to the relevant paper)
Model selection in the large-P small-N scenario is discussed in the framework of two-stage models. Two specific models are considered, namely, two-stage least squares (TSLS) involving instrumental variables (IVs), and mediation models. In both cases, the number of putative variables (either instruments or mediators) is large, but only a small subset should be included in the two-stage model. We use two variable selection methods which are designed for high-dimensional settings, and compare their performance in terms of their ability to find the true IVs or mediators. Our approach is demonstrated via simulations and case studies.

When: June 3, 16:00

Speaker: Ryan Sun (MD Anderson Cancer Center)
Title: Set-based inference for analysis of genetic compendiums (link to the relevant paper)
Join Zoom Meetinglink
The increasing popularity of biobanks and other genetic compendiums has introduced exciting opportunities to extract knowledge using datasets combining information from a variety of genetic, genomic, environmental, and clinical sources. To manage the large number of hypothesis tests that are often performed with such data, set-based inference strategies have emerged as a popular alternative to testing individual features. However, existing tests may be challenged to provide adequate power for detecting sparse and weak signals, especially in the presence of highly correlated features. In this talk, we discuss new, powerful set-based tests, and we additionally cover certain statistical considerations of increasing consequence in the biobank era. In particular, we examine the differences between testing sets of explanatory features and sets of outcome features, and we consider strategies for situations where the global null is not the null hypothesis of interest.

When: June 10, 16:00

SpeakerOmer Weissbrod (Harvard T.H. Chan School of Public Health)
Join Zoom Meetinglink
Title: Predicting genetic disease risk across diverse human populations (part of this work is available online as a preprint)
Polygenic risk is a measure of our genetic predisposition to complex diseases like diabetes or schizophrenia. Predicting polygenic risk is a key goal of genetics research, as it would enable identifying individuals at risk years in advance. However, complex genetic diseases are determined by thousands of genes and millions of genetic variants, leading to computational and statistical challenges. These problems are compounded when predicting risk for non-European individuals, for which relatively little data is available, owing to differences in genotype distributions and in genetic disease architectures. Hence, existing use of polygenic risk predictions could exacerbate healthcare disparities.
I will present a Bayesian framework to predict polygenic risk for arbitrary ancestries by combining three ideas: (1) identifying the most clinically important variants among millions of tightly-correlated variants; (2) prioritizing biologically important variants using hundreds of biological annotations from external datasets; and (3) pooling data across multiple genetic ancestries to find shared but heterogeneous genetic patterns
We evaluated our approach using millions of genetic variants across over 400,000 individuals from multiple genetic ancestries. Our approach improved polygenic risk prediction across 15 genetically uncorrelated traits for African-ancestry individuals by over 70% on average over the state of the art.

When: June 17, 16:00

SpeakerChuan Hong (Harvard Medical School)

Join Zoom Meetinglink

Title: Learning with Noisy Labels in Electronic Health Records (link to the relevant paper)

With the increasing availability of rich electronic health records (EHR) data for research, a critical step in realizing its translational potential is to accurately and efficiently classify disease outcomes for individual patients. A wide range of classification algorithms have been developed and validated for EHR disease phenotypes. For a given disease of interest, the algorithms are typically trained and/or validated using a small set of gold standard labels manually annotated via medical chart review by domain experts. These annotations are unfortunately subject to misclassification, particularly for conditions that are either episodic or not clinically distinctive, due to insufficient documentation or human error. Ignoring labeling error may lead to biases in both the estimated classification model and its accuracy for predicting the true underlying phenotype status. These biases can also lead to bias and loss of power in downstream analyses such as genomic association studies based on the classified phenotype status. In this paper, we propose a set of robust composite likelihood approaches to several related inference problems for both classification algorithms as well as downstream association analyses under a semi-supervised learning setting, where a small number of noisy labels and a large set of unlabeled observations on predictive features and possibly genetic data are available. We demonstrate that under mild regularity conditions the proposed estimators are consistent and asymptotically normal, and the asymptotic variance of the proposed estimators is always smaller than that of the supervised counterparts under correct model specification. The proposed method is evaluated through extensive simulation studies and illustrated with a real EHR study for bipolar disorder.