# Statistics seminar

To be added to the mailing list, please contact Linh Nghiem. All seminars are 2-3pm on Fridays, except noted otherwise.

## Seminars in 2024

Friday, 22 March

** Title: ** Adjusted predictions in generalized estimation equations

** Speaker: ** Francis Hui, Australian National University

** Abstract: ** Generalized estimating equations (GEEs) is a popular regression approach that requires specification of the first two marginal moments of the data, along with a working correlation matrix to capture the covariation between responses e.g., temporal correlations within clusters in longitudinal data. The majority of research and application of GEEs has focused on estimation and inference of the regression coefficients in the marginal mean. When it comes to prediction using GEEs, practitioners often simply, and quite understandably, also based it on the regression model characterizing the marginal mean.

In this talk, we propose a simple adjustment to predictions in GEEs based on utilizing information in the assumed working correlation matrix. Focusing on longitudinal data, and by viewing the GEE from the perspective of solving a working linear model, we borrow ideas from universal kriging to construct a “conditional” predictor that leverages temporal correlations between the new and current observations within the same cluster. We establish some theoretical conditions for the proposed adjusted GEE predictor to outperform the standard (unadjusted) predictor. Simulations show even when we misspecify the working correlation structure, adjusted GEE predictors (combined with an information criterion for choosing the working correlation structure) can still improve on predictive performance of standard GEE predictors as well as the so- called “oracle GEE predictor” using all observations.

This is a joint work with Samuel Mueller (MQ University) and A.H.Welsh (ANU).

Friday, 15 March

** Title: ** On aspects of localized learning

** Speaker: ** Andreas Christmann, Universitat Bayreuth

** Abstract: ** Many machine learning methods do not scale well with respect to computation time and computer memory, if the sample size increases. Divide and conquer methods can be helpful in this respect and often allow for parallel computing. This is true e.g. for distributed learning. This talk will focus on localized learning which is a related but slightly different approach. We will mainly consider aspects of statistical robustness and stability questions. The first part of the talk will investigate the question of qualitative robustness without specifying a particular learning method. The second part will deal with total stability of kernel-based methods and is based on Koehler and Christmann (2022, JMLR, 23, 1-41).

Friday, 8 March

** Title: ** Learning Deep Representations with Optimal Transport

** Speaker: ** He Zhao, Data61, CSIRO

** Abstract: ** Originated from the works of mathematicians, statisticians, and economists, Optimal Transport (OT) is a powerful tool for resource allocation. Recently, OT has gained significant attention and utility in machine learning and deep learning, particularly in areas where the comparison of probability measures is essential. In this talk, I will introduce two recent works of mine on applying OT for deep representation learning that captures essential structural information in the data, leading to improved generalisation and robustness. One is on the task of image data augmentation for imbalanced problems and the other is on missing value imputation.

Friday, 1 March

** Title: ** A BLAST from the past: revisiting BLAST's E-value

** Speaker: ** Uri Keich, University of Sydney

** Abstract: ** The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research. BLAST established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined as the expected number of optimal local alignments that will score at least as high as the observed alignment score, assuming that the query and the database sequences are randomly generated. We critically reevaluate BLAST's E-values, showing that they can be at times significantly conservative while at others too liberal.

We offer an alternative approach based on generating a small sample from the null distribution of random optimal alignments, and testing whether the observed alignment score is consistent with it. In contrast with BLAST, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than the BLAST E-value. Indeed, in cases where BLAST's analysis is valid (i.e., not too liberal), our approach seems to deliver a greater number of correct alignments. One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties, avoiding BLAST's limited options of matrices and penalties. In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing with E-values, which can at times be difficult to interpret.

Joint work with Yang Young Lu (Cheriton School of Computer Science, University of Waterloo) and William Stafford Noble (Department of Genome Sciences and Paul G. Allen School of Computer Science and Engineering, University of Washington)

## Seminars in 2023

Friday, 3 November

** Title: ** Post-Processing MCMC with Control Variates

** Speaker: ** Leah South, Queensland University of Technology

** Abstract: ** Control variates are valuable tools for improving the precision of Monte Carlo estimators but they have historically been challenging to apply in the context of Markov chain Monte Carlo (MCMC). This talk will describe several new and broadly applicable control variates that are suitable for post-processing of MCMC. The methods I'll speak about are in the class of estimators that use Langevin Stein operators to generate control variates or control functionals, so they are applicable when the gradients of the log target pdf are available (as is the case for gradient-based MCMC). I will first give an overview of existing methods in this field, as per [1], before introducing two new methods. The first method [2], referred to as semi-exact control functionals (SECF), is based on control functionals and Sard’s approach to numerical integration. The use of Sard’s approach ensures that our control functionals are exact on all polynomials up to a fixed degree in the Bernstein-von-Mises limit. SECF is also bias-correcting, in the sense that it is capable of removing asymptotic bias in biased MCMC samplers under some conditions. The second method [3] applies regularisation to improve the performance of popular Stein-based control variates for high-dimensional Monte Carlo integration. Several Bayesian inference examples will be used to illustrate the potential for reduction in mean square error.

[1] South, L. F, Riabiz, M., Teymur, O. and Oates, C. J. (2022). Postprocessing of MCMC. Annual Review of Statistics and Its Application, 9, 529-555.

[2] South, L. F., Karvonen, T., Nemeth, C., Girolami, M., & Oates, C. (2022). Semi-Exact Control Functionals From Sard's Method. Biometrika, 109(2), 351-367.

[3] South, L. F., Oates, C. J., Mira, A., & Drovandi, C. (2023). Regularized zero-variance control variates. Bayesian Analysis, 18(3), 865-888.

Friday, 27 October

** Title: ** A tripartite decomposition of vector autoregressive processes into temporal flows

** Speaker: ** Brendan Beare, University of Sydney School of Economics

** Abstract: ** Every autoregressive process taking values in a finite dimensional complex vector space is shown to be equal to the sum of three processes which we call the forward, backward and outward temporal flows. Each of the three temporal flows may be decomposed further into the sum of a stochastic component and a deterministic component. The forward temporal flow consists of a stationary infinite weighted sum of past innovations (the stochastic component) and a term which decays/grows exponentially as we move forward/backward in time and is determined by the behaviour of the autoregressive process in the arbitrarily distant past (the deterministic component). The backward temporal flow consists of a stationary infinite weighted sum of future innovations (the stochastic component) and a term which grows/decays exponentially as we move forward/backward in time and is determined by the behaviour of the autoregressive process in the arbitrarily distant future (the deterministic component). The outward temporal flow consists of a nonstationary finite weighted sum of innovations going outward from time zero (the stochastic component) and a term which grows at a polynomial rate as we move away from time zero and is determined by the value taken by the autoregressive process at time zero (the deterministic component). Each of the three temporal flows are obtained by applying one of three complementary spectral projections to the autoregressive process. The three spectral projections correspond to a separation of the eigenvalues of the autoregressive coefficient into three regions: the open unit disk, the complement to the closed unit disk, and the unit circle.

Friday, 15 September

** Title: ** Forecasting intraday financial time series with sieve bootstrapping and dynamic updating

** Speaker: ** Hanlin Shang, Macquarie University

** Abstract: ** Intraday financial data often take the form of a collection of curves that can be observed sequentially over time, such as intraday stock price curves. These curves can be viewed as a time series of functions observed on equally spaced and dense grids. Due to the curse of dimensionality, high-dimensional data pose challenges from a statistical aspect; however, it also provides opportunities to analyze a rich source of information so that the dynamic changes within short-time intervals can be better understood. We consider a sieve bootstrap method to construct 1-day-ahead point and interval forecasts in a model-free way. As we sequentially observe new data, we also implement two dynamic updating methods to update point and interval forecasts for achieving improved accuracy. The forecasting methods are validated through an empirical study of 5-min cumulative intraday returns of the S&P/ASX All Ordinaries Index.

Friday, 1 September

** Title: ** Directions old and new: Palaeomagnetism and Fisher meet modern statistics

** Speaker: ** Janice Scealy, Australian National University

** Abstract: ** Most modern articles in the palaeomagnetism literature are based on statistics developed by Fisher's 1953 paper Dispersion on a sphere, which assumes independent and identically distributed (iid) spherical data. However, palaeomagnetic sample designs are usually hierarchical, where specimens are collected within sites and the data are then combined across sites to calculate an overall mean direction for a geological formation. The specimens within sites are typically more similar than specimens between different sites, and so the iid assumptions fail. We will first review, contrast and compare both the statistics and geophysics literature on the topic of analysis methods for clustered data on spheres. We will then present a new hierarchical parametric model, which avoids the unrealistic assumption of rotational symmetry in Fisher's 1953 paper Dispersion on a sphere and may be broadly useful in the analysis of many palaeomagnetic datasets. To help develop t! he model, we use publicly available data as a case study collected from the Golan Heights volcanic plateau. Next, we will explore different methods for constructing confidence regions for the overall mean direction based on clustered data. Two bootstrap confidence regions that we propose perform well and will be especially useful to geophysics practitioners.

Wednesday, 30 August

** Title: ** Unsupervised Spatial-Temporal Decomposition for Feature Extraction and Anomaly Detection

** Speaker: ** Jian Liu, University of Arizona

** Abstract: ** The advancement of sensing and information technology has made it reliable and affordable to collect data continuously from many sensors that are spatially distributed, generating readily available Spatial-Temporal (ST) data. While the abundant ST information embedded in such high-dimensional ST data provides engineers with unprecedented opportunities to understand, monitor, and control the engineering processes, the complex ST correlation makes conventional statistical data analysis methods ineffective and inefficient. This is especially true for ST feature extraction and ST anomaly detection, where the features of interest or the anomalies possess ST characteristics subtly different from the normal routine background. In this seminar, I will introduce a generic unsupervised learning method based on ST decomposition. The high-dimensional ST data are modeled as a tensor, which is then decomposed into different tensor components represented as a combination of a series of lower-dimensional factors. Without relying on pre-annotated training data, these tensor components will be estimated to indicate the latent features and/or anomalies of interest. A regularization approach is adopted to incorporate the knowledge of the tensor components’ intrinsic ST characteristics into the algorithm to improve the accuracy and robustness of the model estimation. Multiple case studies were conducted to demonstrate the effectiveness of the proposed methods, including water burst detection in water distribution systems and video segmentation.

Friday, 18 August

** Title: ** Conceptualizing experimental controls using the potential outcomes framework

** Speaker: ** Kristen Hunter, University of New South Wales

** Abstract: ** The goal of a well-controlled study is to remove unwanted variation when estimating the causal effect of the intervention of interest. Experiments conducted in the basic sciences frequently achieve this goal using experimental controls, such as “negative” and “positive” controls, which are measurements designed to detect systematic sources of such unwanted variation. Here, we introduce clear, mathematically precise definitions of experimental controls using potential outcomes. Our definitions provide a unifying statistical framework for fundamental concepts of experimental design from the biological and other basic sciences. These controls are defined in terms of whether assumptions are being made about a specific treatment level, outcome, or contrast between outcomes. We discuss experimental controls as tools for researchers to wield in designing experiments and detecting potential design flaws, including using controls to diagnose unintended factors that influence the outcome of interest, assess measurement error, and identify important subpopulations. We believe that experimental controls are powerful tools for reproducible research that are possibly underutilized by statisticians, epidemiologists, and social science researchers.

Thursday, 3 August

** Title: ** On arbitrarily underdispersed discrete distributions

** Speaker: ** Alan Huang, University of Queensland

** Abstract: ** We survey a range of popular count distributions, investigating which (if any) can be arbitrarily underdispersed, i.e., its variance can be arbitrarily small compared to its mean. A philosophical implication is that certain models failing this criterion should perhaps not be considered “statistical models” according to the extendibility criterion of McCullagh (2002). Four practical implications will be discussed. We suggest that all generalizations of the Poisson distribution be tested against this property.

Friday, June 2

**Speaker:** Dennis Leung, University of Melbourne

Location: Room 2020 Abercrombie Bldg H70

Time: 11 am - 12pm

**Title:** ZAP: Z-Value Adaptive Procedures for False Discovery Rate Control With Side Information

**Abstract:**
Adaptive multiple testing with covariates is an important research direction that has gained major attention in recent years, as it has been widely recognized that leveraging side information provided by auxiliary covariates can improve the power of testing procedures for controlling the false discovery rate (FDR), e.g. in the differential expression analysis of RNA-sequencing data, the average read depths across samples can provide useful side information alongside individual p-values, and incorporating such information promises to improve the power of existing methods. However, for two-sided hypotheses, the usual data processing step that transforms the primary statistics, generally known as z-values, into p-values not only leads to a loss of information carried by the main statistics but can also undermine the ability of the covariates to assist with the FDR inference. Motivated by this and building upon recent advances in FDR research, we develop ZAP,! a z-value based covariate-adaptive methodology. It operates on the intact structural information encoded jointly by the z-values and covariates, to mimic an optimal oracle testing procedure that is unattainable in practice; the power gain of ZAP can be substantial in comparison with p-value based methods.

Friday, May 19

**Speaker:** Andrew Zammit Mangion , University of Wollongong

**Title:** Neural Point Estimation for Fast Optimal Likelihood-Free Inference

**Abstract:**
Neural point estimators are neural networks that map data to parameter point estimates. They are fast, likelihood free and, due to their amortised nature, amenable to fast bootstrap-based uncertainty quantification. In this talk I give an overview of this relatively new inferential tool, giving particular attention to the ubiquitous problem of making inference from replicated data, which we address in the neural setting using permutation-invariant neural networks. Through extensive simulation studies we show that these neural point estimators can quickly and optimally (in a Bayes sense) estimate parameters in weakly-identified and highly-parameterised models, such as models of spatial extremes, with relative ease. We demonstrate their applicability through an analysis of extreme sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates and bootstrap-based confidence intervals from hundreds of spatial fields in a fraction of a second. This is joint work with Matthew Sainsbury-Dale and Raphaël Huser.

Friday, April 21

**Speaker:** Bradley Rava, University of Sydney Business School

**Title:** A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification

**Abstract:**
We study fairness in classification, where one wishes to make automated decisions for people from different protected groups. When individuals are classified, the decision errors can be unfairly concentrated in certain protected groups. We develop a fairness-adjusted selective inference (FASI) framework and data-driven algorithms that achieve statistical parity in the sense that the false selection rate (FSR) is controlled and equalized among protected groups. The FASI algorithm operates by converting the outputs from black-box classifiers to R-values, which are intuitively appealing and easy to compute. Selection rules based on R-values are provably valid for FSR control, and avoid disparate impacts on protected groups. The effectiveness of FASI is demonstrated through both simulated and real data.

Friday, April 14

**Speaker:** Ryan Elmore, University of Colorado

**Title:** NBA Action, It’s FANtastic (and great for data analytics too!)

**Abstract:**
In this talk, I will describe my two most recent statistical problems and solutions related to the National Basketball Association (NBA). In particular, I will discuss (1) the usefulness of a coach calling a timeout to thwart an opposition’s momentum and (2) a novel metric for rating the overall shooting effectiveness of players in the NBA. I will describe the motivation for each problem, how to find data for NBA analyses, modeling considerations, and our results. Lastly, I will describe why I think the analysis of sport, in general, provides an ideal venue for teaching/learning statistical or analytical concepts and techniques.

Friday, March 17

**Speaker:** Sarat Moka, UNSW

Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/89991403193

**Title:** Best Subset Selection for Linear Dimension Reduction Models using Continuous Optimization

**Abstract:**
Selecting the optimal variables in high-dimensional contexts is a challenging task in both supervised and unsupervised learning. This talk focuses on two popular linear dimension-reduction methods, principal components analysis (PCA) and partial least squares (PLS), with diverse applications in genomics, biology, environmental science, and engineering. PCA and PLS construct principal components that are combinations of original variables. However, interpreting principal components becomes challenging when the number of variables is large. To address this issue, we discuss a new approach that combines best subset selection with PCA and PLS frameworks. We use a continuous optimization algorithm to identify the most relevant variables for constructing principal components. Our approach is evaluated using two real datasets, one analysed using PCA and the other using PLS. Empirical results demonstrate the effectiveness of our method in identifying the optimal subset of variables. This is a joint work with Prof. Benoit Liquet and Prof. Samuel Muller.

Friday, March 10

**Speaker:** Suojin Wang, Texas A&M University

Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/84730883100

**Title:** Robust regression using probabilistically linked data

**Abstract:**
There is growing interest in a data integration approach to survey sampling,
particularly where population registers are linked for sampling and subsequent
analysis. The reason for doing this is simple: it is only by linking the same
individuals in the different sources that it becomes possible to create a data set
suitable for analysis. But data linkage is not error free. Many linkages are
non-deterministic, based on how likely a linking decision corresponds to a correct
match, i.e., it brings together the same individual in all sources. High quality
linking will ensure that the probability of this happening is high. Analysis of the
linked data should take account of this additional source of error when this is not the
case. This is especially true for secondary analysis carried out without access to the
linking information, i.e., the often confidential data that agencies use in their record
matching. We describe an inferential framework that allows for linkage errors when
sampling from linked registers. After first reviewing current research activity in this
area, we focus on secondary analysis and linear regression modelling, including the
important special case of estimation of subpopulation and small area means. In doing so
we consider both robustness and efficiency of the resulting linked data inferences.

Friday, March 3

**Speaker:** Akanksha Negi, Monash University

Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/84730883100

**Title:** Difference-in-differences with a misclassified treatment

**Abstract:**
This paper studies identification and estimation of the
average treatment effect on the treated (ATT) in difference-in-difference designs when
the variable that classifies individuals into treatment and control groups (treatment
status, D) is differentially (or endogenously) mis-measured. We show that
misclassification in D hampers consistent estimation of ATT because 1) it restricts us
from identifying the truly treated from those misclassified as being treated and 2)
differential misclassification in counterfactual trends may result in parallel trends
being violated with D even when they hold with the true but unobserved D*. We propose a
two-step estimator to correct for this problem using a flexible parametric specification
which allows for considerable heterogeneity in treatment effects. The solution uses a
single exclusion restriction embedded in a partial observability probit to
point-identify the true ATT. Subsequently, we derive the asymptotic properties of this
estimator in panel and repeated cross-section settings. Finally, we apply this method
to a large scale in-kind transfer program in India which is known to suffer from
targeting errors.

Friday, February 24

**Speaker:** Lucy Gao, University of British Columbia

Location: Zoom at https://uni-sydney.zoom.us/j/86954517372

**Title:** Valid inference after clustering with application to single-cell RNA-sequencing data

**Abstract:**
In single-cell RNA-sequencing studies, researchers often model the variation
between cells with a latent variable, such as cell type or pseudotime, and investigate
associations between the genes and the latent variable. As the latent variable is
unobserved, a two-step procedure seems natural: first estimate the latent variable, then
test the genes for association with the estimated latent variable. However, if the same
data are used for both of these steps, then standard methods for computing p-values in
the second step will fail to control the type I error rate.

In my talk, I will introduce two different approaches to this problem. First, I will
apply ideas from selective inference to develop a valid test for a difference in means
between clusters obtained from the hierarchical clustering algorithm. Then, I will
introduce count splitting: a flexible framework that enables valid inference after
latent variable estimation in count-valued data, for virtually any latent variable
estimation technique and inference approach.

This talk is based on joint work with Jacob Bien (University of Southern California),
Daniela Witten and Anna Neufeld (University of Washington), as well as Alexis Battle and
Joshua Popp (Johns Hopkins University).

## Seminars in 2022

Friday, October 21

**Speaker:** Howard Bondell, University of Melbourne

**Title: Do you have a moment? Bayesian inference using estimating equations via empirical
likelihood
**

**Abstract:**
Bayesian inference typically relies on specification of a likelihood as a key
ingredient. Recently, likelihood-free approaches have become popular to avoid
specification of potentially intractable likelihoods. Alternatively, in the Frequentist
context, estimating equations are a popular choice for inference corresponding to an
assumption on a set of moments (or expectations) of the underlying distribution, rather
than its exact form. Common examples are in the use of generalised estimating equations
with correlated responses, or in the use of M-estimators for robust regression avoiding
the distributional assumptions on the errors. In this talk, I will discuss some of the
motivation behind empirical likelihood, and how it can be used to incorporate a fully
Bayesian analysis into these settings where only specification of moments is desired.
This allows one to then take advantage of prior distributions that have been developed
to accomplish various shrinkage tasks, both theoretically and in practice. I will
further discuss computational issues that arise due to non-convexity of the support of
this likelihood and the corresponding posterior, and show how this can be rectified to
allow for MCMC and variational approaches to perform posterior inference.

Friday 16 Sep

**Speaker:** Mohammad Davoudabadi, University of Sydney

**Title: Advanced Bayesian approaches for state-space models with a case study on soil carbon
sequestration
**

**Abstract:**
Sequestering carbon into the soil can mitigate the atmospheric concentration of greenhouse gases, improving crop productivity and yield financial gains for farmers
through the sale of carbon credits. In this work, we develop and evaluate advanced
Bayesian methods for modelling soil carbon sequestration and quantifying uncertainty
around predictions that are needed to fit more complex soil carbon models, such as
multiple-pool soil carbon dynamic models. This study demonstrates efficient
computational methods using a one-pool model of the soil carbon dynamics previously used
to predict soil carbon stock change under different agricultural practices applied at
Tarlee, South Australia. We focus on methods that can improve the speed of computation
when estimating parameters and model state variables in a statistically defensible way.
This study also serves as a tutorial on advanced Bayesian methods for fitting complex
state-space models.

Friday 26 Aug

**Speaker:** Pauline O'Shaughnessy, University of Wollongong

**Title: Multiverse of Madness: Multivariate Moment-Based Density Estimation and Its
(Possible) Applications
**

**Abstract:**
Density approximation is a well-studied topic in statistics, probability and
applied Mathematics. One of the recently popular methods is to construct a unique
polynomial-based series expansion for a density function, which previously lacks
practical uses due to the computational complexity. Recently, we developed a new
approach to approximate the joint density function of multivariate distributions using
moment-based orthogonal polynomial expansion on a bounded space, which is based on a
carefully defined hyper-geometric differential equation with a generalized form.
Exploring the applications of this moment-based density estimation is still underway; so
far, we have tried to implement the density estimation method in data privacy, time
series analysis, and missing data, which will be demonstrated in this talk. This is
joint work with Bradley Wakefield, Prof Yan-Xia Lin, and Wei Mi from UOW.

**Bio:**
Pauline is a lecturer in Statistics at the University of Wollongong. Before she
joined UOW, Pauline was a statistical consultant at ANU. Her current research of
interest is in data privacy and statistical disclosure control and mixed modelling.
Pauline spent (too much) her spare time practising saxophone and trumpet for orchestra
and concert band.

Friday 12 Aug

**Speaker:** Nhat Ho, University of Texas at Austin

**Title: Instability, Computational Efficiency and Statistical Accuracy
**

**Abstract:**
Many statistical estimators are defined as the fixed point of a data-dependent operator,
with estimators based on minimizing a cost function being an important special case.
The limiting performance of such estimators depends on the properties of the
population-level operator in the idealized limit of infinitely many samples. We develop
a general framework that yields bounds on statistical accuracy based on the interplay
between the deterministic convergence rate of the algorithm at the population level, and
its degree of (in)stability when applied to an empirical object based on n samples.
Using this framework, we analyze both stable forms of gradient descent and some
higher-order and unstable algorithms, including Newton method and its
cubic-regularized variant, as well as the EM algorithm. We provide applications of our
general results to several concrete classes of singular statistical models, including
Gaussian mixture estimation, single-index models, and informative non-response models.
We exhibit cases in which an unstable algorithm can achieve the same statistical
accuracy as a stable algorithm in exponentially fewer steps; namely, with the number of
iterations being reduced from polynomial to logarithmic in sample size n.

**Bio:**
Nhat Ho is currently an Assistant Professor of Statistics and Data Sciences at the
University of Texas at Austin. He is also a core member of the Machine Learning
Laboratory. His current research focuses on the interplay of four principles of
statistics and data science: heterogeneity of data, interpretability of models,
stability, and scalability of optimization and sampling algorithms.

Friday 5 Aug

**Speaker:** Ole Maneesoonthorn, Melbourne Business School, University
of Melbourne

**Title: Inference of Volatility Models Using Triple Information Sources: An Approximate Bayesian
Computation Approach
**

**Abstract:**
This paper utilizes three sources of information, daily returns, high
frequency data and market option prices, to conduct inference about stochastic
volatility models. The inferential method of choice is the approximate Bayesian
computation (ABC) method, which allows us to construct posterior distributions of the
model unknowns from data summaries without assuming a large dimensional measurement
model from the three information sources. We employ ABC cut posteriors to dissect the
information sources in posterior inference and show that it significantly reduces the
computational burden compared to conventional posterior sampling. The benefit of
utilizing multiple information sources in inference is explored in the context of
predictive performance of financial returns.

Friday 27 May

**Speaker:** Tao Zou, Australian National University

**Title: Quasi-score matching estimation for spatial autoregressive models with random
weights matrix and regressors
**

**Abstract:**
Due to the rapid development of social networking sites, the spatial
autoregressive (SAR) model has played an important role in social network studies.
However, the commonly used quasi-maximum likelihood estimation (QMLE) for the SAR model
is not computationally scalable as the network size is large. In addition, when
establishing the asymptotic distribution of the parameter estimators of the SAR model,
both weights matrix and regressors are assumed to be nonstochastic in classical spatial
econometrics, which is perhaps not realistic in real applications. Motivated by the
machine learning literature, quasi-score matching estimation for the SAR model is
proposed. This new estimation approach is still likelihood-based, but significantly
reduces the computational complexity of the QMLE. The asymptotic properties of
parameter estimators under the random weights matrix and regressors are established,
which provides a new theoretical framework for the asymptotic inference of the SAR type
models. The usefulness of the quasi-score matching estimation and its asymptotic
inference are illustrated via extensive simulation studies. This is a joint work with
Dr Xuan Liang at ANU.

Friday 20 May

**Speaker:** Jing Cao, Southern Methodist University

**Title:
**

**Abstract:**
As a branch of machine learning, multiple instance learning (MIL) learns from
a collection of labeled bags, each containing a set of instances. Each instance is
described by a feature vector. Since its emergence, MIL has been applied to solve
various problems including content-based image retrieval, object tracking/detection, and
computer-aided diagnosis. In this study, we apply MIL to text sentiment analysis. The
current neural-network-based approaches in text analysis enjoy high classification
accuracies but usually lack interpretability. The proposed Bayesian MIL model treats
each text document as a bag, where the words are the instances. The model has a
two-layered structure. The first layer identifies whether a word is essential or not
(i.e., primary instance), and the second layer assigns a sentiment score over the
individual words of a document. The motivation of our approach is that by the
combination of the attention mechanism from neural networks with a relatively simple
statistical model, hopefully, we can combine the best of two worlds: the
interpretability of a statistical model and the high predictive performance of
neural-network models.

**Bio:**

Friday 13 May

**Speaker:** Ziyang Lyu, University of New South Wales

**Title:
**

**Abstract:**
We derive asymptotic results for the maximum likelihood and restricted maximum
likelihood (REML) estimators of the parameters in the nested error regression model when
both the number of independent clusters and the cluster sizes (the number of
observations in each cluster) go to infinity. A set of conditions is given under which
the estimators are shown to be asymptotically normal. There are no restrictions on the
rate at which the cluster size tends to infinity. We also deal with the asymptotic
distributions of the estimated best linear unbiased predictors (EBLUP) of the random
effects, with ML/REML, estimated variance components, converge to the true distributions
of the corresponding random effects, when both of the numbers of independent clusters
and the cluster sizes (the number of observations in each cluster) go to infinity.

**Bio:**

Friday 29 April

**Speaker:** Benoit Liquet-Weiland, Macquarie University

**Title: Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set
**

**Abstract:**
It is well established that incorporation of prior knowledge on the structure
existing in the data for potential grouping of the covariates is key to more accurate
prediction and improved interpretability.

In this talk, I will present new multivariate methods incorporating grouping structure
in frequentist and Bayesian methodology for variable selection and dimension reduction
to tackle the analysis of high dimensional and Big-Data set. We develop methods using
both penalised likelihood methods and Bayesian spike and slab priors to induce
structured sparsity. Illustration on genomics dataset will be presented.

Friday 22 April

**Speaker:** Yingxin Li, University of Sydney

**Title: Statistical modelling and machine learning for single-cell data harmonisation and
analysis
**

**Abstract:**
Technological advances such as large-scale single-cell profiling have exploded
in recent years and enabled unprecedented understanding of the behaviour of individual
cells. Effectively harmonising multiple collections and different modalities of
single-cell data and accurately annotating cell types using reference, which we consider
as the step of intermediate data analysis in this thesis, serve as a foundation
for the downstream analysis to uncover biological insights from single-cell data. This
thesis proposed several statistical modelling and machine learning methods to address
several challenges in intermediate data analysis in the single-cell omics era,
including: (1) scMerge to effectively integrate multiple collections of single-cell
RNA-sequencing (scRNA-seq) datasets from a single modality; (2) scClassify to annotate
cell types for scRNA-seq data by capitalising on the large collection of well-annotated
scRNA-seq datasets; and (3) scJoint to integrate unpaired atlas-scale single-cell
multi-omics data and transfer labels from scRNA-seq datasets to scATAC-seq data. We
illustrate that the proposed methods enable a novel and scalable workflow to
integratively analyse large-cohort single-cell data, demonstrating using a collection of
single-cell multi-omics COVID-19 datasets.

**Bio:**
Yingxin Lin is a final year PhD student in Statistics at The University of Sydney
under the supervision of Prof. Jean Yang, Dr. Rachel Wang and A/Prof. John Ormerod.
Since the beginning of this year, she has been working as a postdoctoral researcher at
the University of Sydney. She is a member of the School of Mathematics and Statistics
and Sydney Precision Bioinformatics Alliance. Her research interests lie broadly in
statistical modelling and machine learning for various omics, biomedical and clinical
data, specifically focusing on methodological development and data analysis for
single-cell omics data.

Friday 8 April

**Speaker:** Michael Stewart, University of Sydney

**Title: Detection boundaries for sparse gamma scale mixture models
**

**Abstract:**
Mixtures of distributions from a parametric family are useful for various
statistical problems, including nonparametric density estimation, as well as model-based
clustering. In clustering an enduringly difficult problem is choosing the number of
clusters; when using mixtures models for model-based clustering this corresponds
(roughly) to choosing the number of components in the mixture. The simplest version of
this model selection problem is choosing between a known single-component mixture, and a
"contaminated" version where a second unknown component is added. Due to certain
structural irregularities, many standard asymptotic results from hypothesis testing do
not apply in these "mixture detection" problems, including those relating to power under
local alternatives. Detection boundaries have arisen over the past few decades as
useful ways to describe what kinds of local alternatives are and are not detectable
(asymptotically) in these problems, in particular in the "sparse" case where the mixing
proportion of the contaminant is very small. We review early work on simple normal
location mixtures, some interesting generalisations and also recent results for a gamma
scale mixture model.

Friday 1 April

**Speaker:** Song Zhang, University of Texas Southwestern Medical Center

**Title: Power Analysis for Cluster Randomized Trials with Multiple Primary Endpoints
**

**Abstract:**
Cluster randomized trials (CRTs) are widely used in different areas of
medicine and public health. Recently, with increasing complexity of medical therapies
and technological advances in monitoring multiple outcomes, many clinical trials attempt
to evaluate multiple primary endpoints. In this study, we present a power analysis
method for CRTs with K > 2 binary co-primary endpoints. It is developed based on the
GEE (generalized estimating equation) approach, and three types of correlations are
considered: inter-subject correlation within each endpoint, intra-subject correlation
across endpoints, and inter-subject correlation across endpoints. A closed-form joint
distribution of the K test statistics is derived, which facilitates the evaluation of
power and type I error for arbitrarily constructed hypotheses. We further present a
theorem that characterizes the relationship between various correlations and testing
power. We assess the performance of the proposed power analysis method based on
extensive simulation studies. An application example to a real clinical trial is
presented.

**Bio:**
Dr. Song Zhang is a professor of biostatistics from the Department
of Population and Data Sciences, University of Texas Southwestern Medical Center. He
received his Ph.D. in statistics from the University of Missouri-Columbia in 2005. His
research interest includes Bayesian hierarchical models with application to disease
mapping, missing data imputation, joint modeling of longitudinal and survival outcomes,
and genomic pathway analysis, as well as experimental design methods for clinical trials
with clustered/longitudinal outcomes, different types of outcome measures, missing data
patterns, correlation structures, and financial constraints. He has co-authored a book
titled "Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical
Research" (Chapman & Hall/CRC). As the principal investigator, Dr. Zhang has received
funding from PCORI, NIH, and NSF to support his research.

Friday 11 March

**Speaker:** Nick Fisher, University of Sydney

**Title: World University Ranking systems, Texas Target Practice, and a Gedankenexperiment
**

**Abstract:**
World University Ranking (WUR) systems play a significant role in how universities are funded and whom they can attract as faculty and students. Yet, for the purpose of comparing universities as institutions of higher education, current systems are readily gamed, provide little guidance about what needs to be improved, and fail to allow for the diversity of stakeholder needs in making comparisons.
We suggest a list of criteria that a WUR system should meet, and which none of the current popular systems appears satisfy. By using as a starting point the goal of creating value for the diverse and sometimes competing stakeholder requirements for a university, we suggest via a thought experiment a rating process that is consistent with all the criteria, and a way in which it might be trialled. Also, the resulting system itself adds value for individual users by allowing them to tune it to their own particular circumstances.
However, an answer to the simple question “Which is the best university” may well be: there is no simple answer.

## Maths & Stats website:

**Last updated:**Friday 15 March 2024 at 05:53 pm. For questions or comments please contact webmaster@maths.usyd.edu.au.]

© 2002-2024 The University of Sydney.

**ABN:** 15 211 513 464.
**CRICOS number:** 00026A.
**Phone:** +61 2 9351 2222.

**Authorised by:** Head, School of Mathematics and Statistics.

Contact the University | Disclaimer | Privacy | Accessibility