Seminars in 2012

Seminars in 2011

University of Sydney Statistics Seminar Series

Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 173

To be added to or removed from the mailing list, or for any other information, please contact Garth Tarr.

2018 Semester 1

Friday March 23

Mikaela Jorgensen
Australian Institute of Health Innovation, Macquarie University, Sydney, Australia

Using routinely collected data in aged care research: a grey area

When the Department of Health launched the My Aged Care website in 2013 they “severely under-estimated the proportion of enquiries and referrals they would receive by fax". Yes, that's fax machines in *2013*. However, electronic data systems are increasingly starting to be used in aged care.

This presentation will discuss the joys of using messy routinely collected datasets to examine the care and outcomes of people using aged care services.

Does pressure injury incidence differ between residential aged care facilities? Is home care service use associated with time to entry into residential aged care? These questions, and more, will be discussed.

We'll take a dive into some multilevel mixed effects models, and resurface with some risk-adjusted funnel plots. People from all backgrounds with an interest in data analysis welcome.

Dr Mikaela Jorgensen is a health services researcher at the Australian Institute of Health Innovation, Macquarie University. She has followed the traditional career pathway from speech pathologist to analyst of linked routinely collected health datasets for the last five years.

Friday March 9

Tim Swartz
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC Canada

A buffet of problems in sports analytics

This talk explores some work that I have done and some topics that you may find interesting with respect to statistics in sport. I plan on discussing a number of problems with almost no discussion of technical details. Some of the sports include hockey, cricket, highland dance, soccer and golf.

Friday February 9

Maria-Pia Victoria-Feser
Research Center for Statistics, Geneva School of Economics and Management, University of Geneva

A prediction divergence criterion for model selection and classification in high dimensional settings

A new class of model selection criteria is proposed which is suited for stepwise approaches or can be used as selection criteria in penalized estimation based methods. This new class, called the d-class of error measure, generalizes Efron's q-class. This class not only contains classical criteria such as Mallow's Cp or the AIC, but also enables one to define new criteria that are more general. Within this new class, we propose a model selection criterion based on a prediction divergence between two nested models' predictions that we call the Prediction Divergence Criterion (PDC). The PDC provides a different measure of prediction error than a criterion associated to each potential model within a sequence and for which the selection decision is based on the sign of differences between the criteria. The PDC directly measures the prediction error divergence between two nested models. As examples, we consider the linear regression models and (supervised) classification. We show that a selection procedure based on the PDC, compared to the Cp (in the linear case), has a smaller probability of overfitting hence leading to parsimonious models for the same out-of-sample prediction error. The PDC is particularly well suited in high dimensional and sparse situations and also under (small) model misspecifications. Examples on a malnutrition study and on acute leukemia classification will be presented.

2017 Semester 2

Date: 8th of December, 2017
Richard Hunt
University of Sydney
Location: Carslaw 173
Title: A New Look at Gegenbauer Long Memory Processes
In this presentation we will look at Long Memory and Gegenbauer Long Memory processes, and methods for estimation of the parameters of these models. After a review of the history of the development of these processes, and some of the personalities involved, we will introduce a new method for the estimation of almost all the parameters of a k-factor Gegenbauer/GARMA process. The method essentially attempts to find parameters for the spectral density to ensure it most closely matches the (smoothed) periodogram. Simulations indicate that the new method has a similar level of accuracy to existing methods (Whittle, Conditional Sum-of-squares), but can be evaluated considerably faster, whilst making few distributional assumptions on the data.

Date: 24th of November, 2017
Prof. Sally Cripps
University of Sydney
Location: Carslaw 173
Title: A spatio-temporal mixture model for Australian daily rainfall, 1876--2015 Modeling daily rainfall over the Australian continent
Daily precipitation has an enormous impact on human activity, and the study of how it varies over time and space, and what global indicators influence it, is of paramount importance to Australian agriculture. The topic is complex and would benefit from a common and publicly available statistical framework that scales to large data sets. We propose a general Bayesian spatio-temporal mixture model accommodating mixed discrete-continuous data. Our analysis uses over 294 million daily rainfall measurements since 1876, spanning 17,606 rainfall measurement sites. The size of the data calls for a parsimonious yet flexible model as well as computationally efficient methods for performing the statistical inference. Parsimony is achieved by encoding spatial, temporal and climatic variation entirely within a mixture model whose mixing weights depend on covariates. Computational efficiency is achieved by constructing a Markov chain Monte Carlo sampler that runs in parallel in a distributed computing framework. We present examples of posterior inference on short-term daily component classification, monthly intensity levels, offsite prediction of the effects of climate drivers and long-term rainfall trends across the entire continent. Computer code implementing the methods proposed in this paper is available as an R package.

Date: 22nd of November, 2017
Speaker: Charles Gray
La Trobe
Location: Carslaw 173
Title: The Curious Case of the Disappearing Coverage: a detective story in visualisation
Do you identify as a member of the ggplot cohort of statisticians? Did you or your students learn statistics in the era of visualisation tools such as R's ggplot package? Would it have made a difference to how you engaged with statistical theory? In this talk, I'll reflect on learning statistics at the same time as visualisation, at the half-way point in my doctoral studies. I'll share how we solved some counterintuitive coverage probability simulation results through visualisation. I see this as an opportunity to generate discussion and learn from you: questions, comments, and a generally rowdy atmosphere are most welcome.

Date: 20th of October, 2017
Time: 1.15-3pm
Location: Access Grid Room
Interview seminar

Date: 13th of October, 2017
Time: 10-12pm
Location: Carslaw 535
Interview seminar

Date: 13th of October, 2017
No seminar (Honours presentations date)

Date: 6th of October, 2017
Kim-Anh Le Cao
University of Melbourne
Location: Carslaw 173
Time: 2-3pm
Title: Challenges in microbiome data analysis (also known as "poop analyses")

Our recent breakthroughs and advances in culture independent techniques, such as shotgun metagenomics and 16S rRNA amplicon sequencing have dramatically changed the way we can examine microbial communities. But does the hype of microbiome outweighs the potential of our understanding of this ‘second genome’? There are many hurdles to tackle before we are able to identify and compare bacteria driving changes in their ecosystem. In addition to the bioinformatics challenges, current statistical methods are limited to make sense of these complex data that are inherently sparse, compositional and multivariate.

I will discuss some of the topical challenges in 16S data analysis, including the presence of confounding variables and batch effects, some experimental design considerations, and share my own personal story on how a team of rogue statisticians conducted their own mice microbiome experiment leading to somewhat surprising results! I will also present our latest analyses to identify multivariate microbial signatures in immune-mediated diseases and discuss what are the next analytical challenges I envision.

This presentation will combine the results of exciting and highly collaborative works between a team of eager data analysts, immunologists and microbiologists. For once, the speaker will abstain from talking about data integration, or mixOmics (oops! but if you are interested keep an eye out in PLOS Comp Biol).

Dr Kim-Anh Lê Cao (NHMRC career development fellow, Senior Lecturer) recently joined the University of Melbourne (Centre for Systems Genomics and School of Mathematics and Statistics). She was awarded her PhD from the Université de Toulouse, France and moved Australia as a postdoctoral research fellow at the Institute for Molecular Bioscience, University of Queensland. She was hired as a research and consultant at QFAB Bioinformatics where she developed a multidisciplinary approach to her research. Between 2014 - 2017 she led a computational biostatistics group at the biomedical research UQ Diamantina Institute. Dr Kim-Anh Lê Cao is an expert in multivariate statistical methods and novel developments. Since 2009, her team has been working on implementing the R toolkit mixOmics dedicated to the integrative analysis of `omics' data to help researchers mine and make sense of biological data (

Date: 22nd of September, 2017
Speaker: Sharon Lee
University of Queensland
Location: Carslaw 173
Title: Clustering and classification of batch data
Motivated by the analysis of batch cytometric data, we consider the problem of jointly modelling and clustering multiple heterogeneous data samples. Traditional mixture models cannot be applied directly to these data. Intuitive approaches such as pooling and post-hoc cluster matching fails to account for the variations between the samples. In this talk, we consider a hierarchical mixture model approach to handle inter-sample variations. The adoption of a skew mixture model with random effects terms for the location parameter allows for the simultaneous clustering and matching of clusters across the samples. In the case where data from multiple classes of objects are available, this approach can be further extended to perform classification of new samples into one of the predefined classes. Examples with real cytometry data will be given to illustrate this approach.

Date: 15th of September, 2017
Speaker: Emi Tanaka
University of Sydney
Location: Carslaw 173
Title: Outlier detection for a complex linear mixed model: an application to plant breeding trials
Outlier detection is an important preliminary step in the data analysis often conducted through a form of residual analysis. A complex data, such as those that are analysed by linear mixed models, gives rise to distinct levels of residuals and thus offers additional challenges for the development of an outlier detection method. Plant breeding trials are routinely conducted over years and multiple locations with the aim to select the best genotype as parents or commercial release. These so-called multi-environmental trials (MET) is commonly analysed using linear mixed models which may include cubic splines and autoregressive process to account for spatial trends. We consider some statistics derived from mean and variance shift outlier model (MSOM/VSOM) and the generalised Cook's distance (GCD) for outlier detection. We present a simulation study based on a set of real wheat yield trials.

Date: 11th of August, 2017
Speaker: Ming Yuan
University of Wisconsin-Madison
Location: Carslaw 173
Title: Quantitation in Colocalization Analysis: Beyond "Red + Yellow = Green"
"I see yellow; therefore, there is colocalization.” Is it really so simple when it comes to colocalization studies? Unfortunately, and fortunately, no. Colocalization is in fact a supremely powerful technique for scientists who want to take full advantage of what optical microscopy has to offer: quantitative, correlative information together with spatial resolution. Yet, methods for colocalization have been put into doubt now that images are no longer considered simple visual representations. Colocalization studies have notoriously been subject to misinterpretation due to difficulties in robust quantification and, more importantly, reproducibility, which results in a constant source of confusion, frustration, and error. In this talk, I will share some of our effort and progress to ease such challenges using novel statistical and computational tools.

Bio: Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.

Date: 13th of July, 2017
Speaker: Irene Gijbels
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Robust estimation and variable selection in linear regression
In this talk the interest is in robust procedures to select variables in a multiple linear regression modeling context. Throughout the talk the focus is on how to adapt the nonnegative garrote selection method to get to a robust variable selection method. We establish estimation and variable selection consistency properties of the developed method, and discuss robustness properties such as breakdown point and influence function. In a second part of the talk the focus is on heteroscedastic linear regression models, in which one also wants to select the variables that influence the variance part. Methods for robust estimation and variable selection are discussed, and illustrations of their influence functions are provided. Throughout the talk examples are given to illustrate the practical use of the methods.

Date: 30th of June, 2017
Speaker: Ines Wilms
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Sparse cointegration
Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. We provide a sparse estimator of the cointegrating vectors. Sparsity means that some elements of the cointegrating vectors are estimated as exactly zero. The sparse estimator is applicable in high-dimensional settings, where the time series length is short relative to the number of time series. Our method achieves better estimation accuracy than the traditional Johansen method in sparse and/or high-dimensional settings. We use the sparse method for interest rate growth forecasting and consumption growth forecasting. We show that forecast performance can be improved by sparsely estimating the cointegrating vectors.

Joint work with Christophe Croux.

Date: 19th of May, 2017
Speaker: Dianne Cook
Monash University
Title:The glue that binds statistical inference, tidy data, grammar of graphics, data visualisation and visual inference

Buja et al (2009) and Majumder et al (2012) established and validated protocols that place data plots into the statistical inference framework. This combined with the conceptual grammar of graphics initiated by Wilkinson (1999), refined and made popular in the R package ggplot2 (Wickham, 2016) builds plots using a functional language. The tidy data concepts made popular with the R packages tidyr (Wickham, 2017) and dplyr (Wickham and Francois, 2016) completes the mapping from random variables to plot elements.

Visualisation plays a large role in data science today. It is important for exploring data and detecting unanticipated structure. Visual inference provides the opportunity to assess discovered structure rigorously, using p-values computed by crowd-sourcing lineups of plots. Visualisation is also important for communicating results, and we often agonise over different choices in plot design to arrive at a final display. Treating plots as statistics, we can make power calculations to objectively determine the best design.

This talk will be interactive. Email your favourite plot to ahead of time. We will work in groups to break the plot down in terms of the grammar, relate this to random variables using tidy data concepts, determine the intended null hypothesis underlying the visualisation, and hence structure it as a hypothesis test. Bring your laptop, so we can collaboratively do this exercise.

Joint work with Heike Hofmann, Mahbubul Majumder and Hadley Wickham

Date: 5th of May, 2017
Speaker: Peter Straka
University of New South Wales
Title: Extremes of events with heavy-tailed inter-arrival times

Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

Date: 28th of April, 2017
Speaker: Botond Szabo
Leiden University
Title: An asymptotic analysis of nonparametric distributed methods

In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several `local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

About the speaker:

Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart. His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: .

Date: 7th of April, 2017
Speaker: John Ormerod
Sydney University
Title: Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?

We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

Date: 7th of April, 2017
Speaker: Shige Peng
Shandong University
Title: Data-based Quantitative Analysis under Nonlinear Expectations

Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a `nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called `phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

Date: 17th of March, 2017
Speaker: Joe Neeman
University of Texas Austin
Title: Gaussian vectors, half-spaces, and convexity

Let \(A\) be a subset of \(R^n\) and let \(B\) be a half-space with the same Gaussian measure as \(A\). For a pair of correlated Gaussian vectors \(X\) and \(Y\), \(\mathrm{Pr}(X \in A, Y \in A)\) is smaller than \(\mathrm{Pr}(X \in B, Y \in B)\); this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from \(A\) is stochastically dominated by its exit time from \(B\).

We will discuss these (and other) inequalities using a kind of modified convexity.

Date: 3rd of March, 2017
Speaker: Ron Shamir
Tel Aviv University
Title: Modularity, classification and networks in analysis of big biomedical data

Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

Date: 31st of January, 2017
Speaker: Genevera Allen
Rice University
Title: Networks for Big Biomedical data

Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

Information for visitors

Enquiries about the Statistics Seminar should be directed to the organiser Garth Tarr.