October 7th - Jessica Li, University of California Los Angeles
Dissecting Double Dipping in Statistical Tests After Clustering
Abstract: Motivated by the widespread use of clustering followed by statistical testing in single-cell and spatial omics data analysis, this talk will address the issue of double dipping. We aim to explore whether double dipping is a significant concern and investigate how various data-splitting and data-simulation strategies can mitigate its impact on inflated false discovery rates (FDR). We will also discuss different perspectives on whether the inference should be conditional on the clustering step or not. In particular, we will highlight the influence of feature correlations on FDR inflation. Through simulation and real-data examples, we will demonstrate how our simulation-based strategy for correcting double dipping can lead to more reliable and insightful discoveries.
October 14th - Hui Zou, University of Minnesota
Box-Cox Regression Revisited
Abstract: Box-Cox regression (1964, JRSSB) is a must-taught topic in any applied regression course. However, the Box-Cox model has received mixed views in the statistical community. While some enthusiastically embraced the idea, others had raised serious concerns. For example, Bickel and Doksum (1981, JASA) analyzed the MLE for the Box-Cox model and showed that the unknown power transformation causes high instability. Despite these criticisms, we believe the core concept of Box-Cox model is valid and powerful for developing new statistical tools. In this work we consider using a nonparametric transformation to replace the parametric Box-Cox transformation in the model, yielding practically more useful models. Moreover, we focus on the high-dimensional application of the new Box-Cox model, aiming at improving the standard practice of using penalized least squares. We develop a composite likelihood inference framework, which avoids the need of estimating the transformation function, for estimating the regression coefficients and for testing linear hypotheses. We will use a supermarket data example to illustrate the new method and theory.
October 21st - Yimin Xiao, Michigan State University
Multivariate Gaussian Random Fields and their Statistical Analysis
Abstract: In recent years, a number of classes of new multivariate random fields have been constructed by using the approaches of covariance matrices, variogram matrices, the convolution method, spectral representations, or systems of stochastic partial differential equations (SPDEs), and have been applied for modeling multivariate spatial data. However, the theoretical development of parameter estimation, prediction, and extreme values for multivariate random fields is still under-developed and the range of their applications is growing constantly. In this talk, we provide an overview on several classes of multivariate Gaussian random fields including multivariate Mat´ern Gaussian fields, operator fractional Brownian motion, and matrix-valued Gaussian random fields, and some recent results on estimation and prediction of bivariate Gaussian random fields. These results illustrate explicitly the effects of the dependence structures among the coordinate processes on statistical analysis of multivariate Gaussian random fields.
October 28th - Dongseok Choi, Oregon Health & Science University
Deep Learning Applications for Two Medical Imaging Data
Abstract: Convolution neural network (CNN) is a deep learning specialist for imaging data. Big data is generally required for training of CNN. However, in typical medical studies, sample sizes are at most 100s. Would CNN work for such small sample sizes? This talk presents the results of applying CNN to two small to medium medical studies using R and related packages. In the first study, CNN was trained to classify X-ray images for diffuse idiopathic skeletal hyperostosis (DISH). The second study used hybrid deep learning to predict glaucoma with optical coherence tomography (OCT) images and clinical data.
November 4th - Jun Zhu, University of California Los Angeles
Spatial Cluster Detection and Change-set Analysis
Abstract: For the purpose of grouping spatial units on a lattice with similar characteristics within a group but with distinction among groups, we consider spatial cluster detection and change-set analysis. While the existing methods for spatial cluster detection are largely based on hypothesis testing or Bayesian models, we consider an alternative frequentist approach using regularization. In addition, we develop a change-set method for two-dimensional spatial data that permit more abrupt changes in space and irregular change sets. A quasi-likelihood approach is taken for statistical inference that accounts for covariates and spatial correlation. Finite-sample properties are investigated in a simulation study and the methods are applied to analyze county-based poverty rates in the Upper Midwest.
November 18th - Steve Portnoy, Portland State University
Canonical Regression Quantiles: A Regression Quantile Approach to Canonical Correlations
Abstract: Canonical Correlations are often used to relate two sets of variables modelled as multivariate vectors X and Y. The regression approach seeks to find linear combinations of X's that predict linear combinations of Y's in some optimal sense. The classical approach uses the covariance matrix of and so relies heavily on normal assumptions and implicit least squares methods. Thus, canonical correlations lack robustness, are unable to address heterogeneity, and are unable to disaggregate responses by quantile effects. An alternative canonical regression quantile (CanRQ) approach seeks to find the linear combination of explanatory variables that best predicts the best linear combination of response variables using a quantile loss function. To apply this approach more generally, subsequent linear combinations are chosen to explain what earlier CanRQ components failed to explain. While numerous technical issues need to be addressed, the major methodological issue concerns directionality: a quantile analysis requires that the notion of a large or small response be well-defined. To address this issue, the sign of at least one response coefficient will be assumed to be non-negative. CanRQ results can be quite different from those of classical canonical correlation, and can offer the kind of improvements offered by regression quantiles in linear models. In theory, inference can be based on the n-choose-m bootstrap. Some new simulations and examples are quite promising.
November 25th - Erik Van Dusen, University of California Berkeley
Abstract: Jupyter Notebooks have revolutionized how we teach data science, providing an interactive and accessible platform for learning computational tools and analytical reasoning. In this talk, Eric Van Dusen will explore how Jupyter Notebooks have been integral to the development of UC Berkeley’s Data Science Undergraduate Studies programs. Key initiatives include the interdisciplinary Data 8 course, which anchors the Data Science major and minor, the Data Science Modules Program that integrates data literacy into diverse disciplines, and applications in teaching Data Science and Economics. The talk will showcase strategies for engaging students across different levels, fostering equity in technical education, and preparing them to apply data science to real-world problems. Practical examples and lessons learned will highlight how Jupyter Notebooks empower educators and students alike.