Fall 2020 Colloquia
- 11/13/20: Xihong Lin (Biostatistics, Harvard T.H. Chan School of Public Health); Time: 12:00pm - 1:00pm via Zoom
- 11/20/20: Daniel Schaid (Biostatistics, Mayo Clinic); Time: 10:00 AM - 11:30 AM via Zoom
Title: Predicting Disease Risk from Genomics Data
Abstract: Accurate disease risk prediction based on genetic and other factors can lead to more effective disease screening, prevention, and treatment strategies. Despite the identifications of thousands of disease-associated genetic variants through genome-wide association studies in the past 15 years, performance of genetic risk prediction remains moderate or poor for most diseases, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes. Moreover, as most genetic studies have been conducted in individuals of European ancestry, it is even more challenging to develop accurate prediction models in other populations. Furthermore, many studies only provide summary statistics instead of individual level genotype and phenotype data. In this presentation, we will discuss a number of statistical methods that have been developed to address these issues through jointly estimating effect sizes (both across genetic markers and across populations), modeling marker dependency, incorporating functional annotations, and leveraging genetic correlations among different diseases. We will demonstrate the utilities of these methods through their applications to a number of complex diseases/traits in large population cohorts, e.g. the UK Biobank data. This is joint work with Wei Jiang, Yiming Hu, Yixuan Ye, Geyu Zhou, Qiongshi Lu, and others.
Title: Three Rs — Reliability, Replicability, Reproducibility: the interplay between statistical science and data science
Abstract: The current pandemic has brought into sharp relief the essential role of data in nearly all aspects of science, government, and public health. But data is useless without explanation and interpretation, and statistical science has a long history and rich traditions of providing explanation and interpretation. However, statistical reasoning is often not well-understood, and misuse of statistical arguments has contributed to confusion over the three R’s in the title. In this talk, Reid describes how data science and statistical science together can provide a robust framework for extracting insights from data reliably, and thus contribute to both replicability and reproducibility. This is illustrated with a selection of examples from recent news articles, along with some discussion on the role of the theory of inference in this framework.
For more information about this event, please click here.
Title: Conditional Calibration for False Discovery Rate Control Under Dependence
Abstract: We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini–Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence. This is joint work with Lihua Lei.
Title: Nonparametric Estimation of Distributions and Diagnostic Accuracy Based on Group-Tested Results with Differential Misclassification
Abstract: This talk concerns the problem of estimating a continuous distribution in a diseased or nondiseased population when only group-based test results on the disease status are available. The problem is challenging in that individual disease statuses are not observed and testing results are often subject to misclassification, with further complication that the misclassification may be differential as the group size and the number of the diseased individuals in the group vary. We propose a method to construct nonparametric estimation of the distribution and obtain its asymptotic properties.
The performance of the distribution estimator is evaluated under various design considerations concerning group sizes and classification errors. The method is exemplified with data from the National Health and Nutrition Examination Survey (NHANES) study to estimate the distribution and diagnostic accuracy of C-reactive protein in blood samples in predicting chlamydia incidence.
Title: Integrative Methods for Biobank-Scale Studies
Abstract: With recent breakthroughs in cost effective genotyping has allowed the creation of ultra-large biobanks that link genetic data of millions of patients with a multitude of phenotypic measurements (usually curated from the electronic health records). The drastic increase in the number of individuals routinely analyzed in genomic studies has enabled novel statistical methods that employ fewer assumptions in estimating key parameters such as heritability explained by genomic variants. I will present methods showcasing how SNP-heritability can be estimated accurately and efficiently, both at genome-wide scale as well at particular regions in the genome.
Title: Efficient Integration of EHR and Other Healthcare Datasets
Abstract: The growth of availability and variety of healthcare data sources has provided unique opportunities for data integration and evidence synthesis, which can potentially accelerate knowledge discovery and enable better clinical decision making. However, many practical and technical challenges, such as data privacy, high-dimensionality and heterogeneity across different datasets, remain to be addressed. In this talk, I will introduce several methods for effective and efficient integration of electronic health records (EHR) and other healthcare datasets. Specifically, we develop communication-efficient distributed algorithms for jointly analyzing multiple datasets without the need of sharing patient-level data. Our algorithms do not require iterative communication across sites, and are able to account for heterogeneity across different datasets. We provide theoretical guarantees for the performance of our algorithms, and examples of implementing the algorithms to real-world clinical research networks.
Title: PPA: Principal Parcellation Analysis for Human Brain Connectomes of Multiple Human Traits
Abstract: Human brain parcellation plays a fundamental role in neuroimaging. Standard practice parcellates the brain into Regions Of Interest (ROIs) based roughly on anatomical function. However, many different schemes are available involving different numbers and locations of ROIs, and choosing which scheme to use in practice is challenging. We propose a novel tractography-based Principal Parcellation Analysis (PPA), which conducts the clustering analysis on the fibers' ending points to redefine parcellation and eventually predict human traits. Specifically, our PPA eliminates the need to choose ROIs manually, reduces subjectivity and leads to a substantially different representation of the connectome. We illustrate the proposed approach through applications to HCP data and show that PPA connectomes are able to improve power in predicting a variety of human traits, while dramatically improving parsimony, compared to anatomical parcellation based connectomes.
Title: Scalable and Consistent Estimation of Random Graph Models With Dependent Edge Variables and Parameter Vectors of Increasing Dimension Using the Pseudolikelihood
Abstract: An important question in statistical network analysis is how to construct models of dependent network data without sacrificing computational scalability and statistical guarantees. In this talk, we demonstrate that scalable estimation of random graph models with dependent edges and parameter vectors of increasing dimension is possible, using maximum pseudolikelihood estimators. On the statistical side, we establish the first consistency results and convergence rates for maximum pseudolikelihood estimators in scenarios where a single observation of dependent random variables is available and the number of parameters increases without bound. The main results make weak assumptions and may be of independent interest. These results help establish the first consistency results and convergence rates for maximum pseudolikelihood estimators of random graph models with dependent edges and parameter vectors of increasing dimension, under weak dependence and smoothness conditions. We showcase consistency results and convergence rates by using generalized β-models with dependent edges and parameter vectors of increasing dimension, in dense- and sparse-graph settings. The talk concludes with a discussion of potential future work and extensions. The primary results presented in this talk assume a complete observation of the random graph is observed. We will discuss how the theoretical developments presented in this talk offer avenues to advance the challenging topic of subgraph-to-graph estimation and inference, which considers estimating a random graph model based only on an observed subgraph.
Title: Brain Connectivity Alternation Detection via Matrix-variate Differential Network Model
Abstract: Brain functional connectivity reveals the synchronization of brain systems through correlations in neurophysiological measures of brain activities. Growing evidence now suggests that the brain connectivity network experiences alterations with the presence of numerous neurological disorders, thus differential brain network analysis may provide new insights into disease pathologies. The data from neurophysiological measurement are often multi-dimensional and in a matrix form, posing a challenge in brain connectivity analysis. Existing graphical model estimation methods either assume a vector normal distribution that in essence requires the columns of the matrix data to be independent, or fail to address the estimation of differential networks across different populations. To tackle these issues, we propose an innovative Matrix-Variate Differential Network (MVDN) model. We exploit the D-trace loss function and a Lasso-type penalty to directly estimate the spatial differential partial correlation matrix, and use an ADMM algorithm for the optimization problem. Theoretical and simulation studies demonstrate that MVDN significantly outperforms other state-of-the-art methods in dynamic differential network analysis. We illustrate with a functional connectivity analysis of an Attention Deficit Hyperactivity Disorder (ADHD) dataset. The hub nodes and differential interaction patterns identified are consistent with existing experimental studies.