# University of Sydney Statistics Seminar Series 2017

*
Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 173
*

Date: 24th of November, 2017

Prof. Sally Cripps

University of Sydney

Location: Carslaw 173

Date: 24th of November, 2017

Prof. Sally Cripps

University of Sydney

Location: Carslaw 173

*Title:*

*Abstract:*

Date: 13th of October, 2017

No seminar (Honours presentations date)Date: 13th of October, 2017

No seminar (Honours presentations date)

Date: 6th of October, 2017

Kim-Ahn Le Cao

University of Melbourne

Location: Carslaw 173

Date: 6th of October, 2017

Kim-Ahn Le Cao

University of Melbourne

Location: Carslaw 173

*Title:*

*Abstract:*

Date: 25th of September, 2017

Speaker: Genevera Allen (to be confirmed)

Rice UniversityDate: 25th of September, 2017

Speaker: Genevera Allen (to be confirmed)

Rice University

Date: 22nd of September, 2017

Speaker: Sharon Lee

University of Queensland

Location: Carslaw 173

Date: 22nd of September, 2017

Speaker: Sharon Lee

University of Queensland

Location: Carslaw 173

*Title: Clustering and classification of batch data*

*Abstract:**Motivated by the analysis of batch cytometric data, we consider the problem of jointly modelling and clustering multiple heterogeneous data samples. Traditional mixture models cannot be applied directly to these data. Intuitive approaches such as pooling and post-hoc cluster matching fails to account for the variations between the samples. In this talk, we consider a hierarchical mixture model approach to handle inter-sample variations. The adoption of a skew mixture model with random effects terms for the location parameter allows for the simultaneous clustering and matching of clusters across the samples. In the case where data from multiple classes of objects are available, this approach can be further extended to perform classification of new samples into one of the predefined classes. Examples with real cytometry data will be given to illustrate this approach.*

Date: 15th of September, 2017

Speaker: Emi Tanaka

University of Sydney

Location: Carslaw 173

Date: 15th of September, 2017

Speaker: Emi Tanaka

University of Sydney

Location: Carslaw 173

*Title: Outlier detection for a complex linear mixed model: an application to plant breeding trials*

*Abstract:**Outlier detection is an important preliminary step in the data analysis often conducted through a form of residual analysis. A complex data, such as those that are analysed by linear mixed models, gives rise to distinct levels of residuals and thus offers additional challenges for the development of an outlier detection method. Plant breeding trials are routinely conducted over years and multiple locations with the aim to select the best genotype as parents or commercial release. These so-called multi-environmental trials (MET) is commonly analysed using linear mixed models which may include cubic splines and autoregressive process to account for spatial trends. We consider some statistics derived from mean and variance shift outlier model (MSOM/VSOM) and the generalised Cook's distance (GCD) for outlier detection. We present a simulation study based on a set of real wheat yield trials.*

Date: 11th of August, 2017

Speaker: Ming Yuan

University of Wisconsin-Madison

Location: Carslaw 173

Date: 11th of August, 2017

Speaker: Ming Yuan

University of Wisconsin-Madison

Location: Carslaw 173

*Title: Quantitation in Colocalization Analysis: Beyond "Red + Yellow = Green"*

*Abstract:**"I see yellow; therefore, there is colocalization.” Is it really so simple when it comes to colocalization studies? Unfortunately, and fortunately, no. Colocalization is in fact a supremely powerful technique for scientists who want to take full advantage of what optical microscopy has to offer: quantitative, correlative information together with spatial resolution. Yet, methods for colocalization have been put into doubt now that images are no longer considered simple visual representations. Colocalization studies have notoriously been subject to misinterpretation due to difficulties in robust quantification and, more importantly, reproducibility, which results in a constant source of confusion, frustration, and error. In this talk, I will share some of our effort and progress to ease such challenges using novel statistical and computational tools.*

Bio: Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.

Joint work with Christophe Croux.

Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several `local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

About the speaker:

Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart. His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: http://math.bme.hu/~bszabo/index_en.html .

We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a `nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called `phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

Let \(A\) be a subset of \(R^n\) and let \(B\) be a half-space with the same Gaussian measure as \(A\). For a pair of correlated Gaussian vectors \(X\) and \(Y\), \(\mathrm{Pr}(X \in A, Y \in A)\) is smaller than \(\mathrm{Pr}(X \in B, Y \in B)\); this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from \(A\) is stochastically dominated by its exit time from \(B\).

We will discuss these (and other) inequalities using a kind of modified convexity.

Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

Enquiries about the Statistics Seminar should be directed to the organizer John Ormerod.

Last updated on 22 May 2017.Bio: Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.

**In this talk the interest is in robust procedures to select variables in a multiple linear regression modeling context. Throughout the talk the focus is on how to adapt the nonnegative garrote selection method to get to a robust variable selection method. We establish estimation and variable selection consistency properties of the developed method, and discuss robustness properties such as breakdown point and influence function. In a second part of the talk the focus is on heteroscedastic linear regression models, in which one also wants to select the variables that influence the variance part. Methods for robust estimation and variable selection are discussed, and illustrations of their influence functions are provided. Throughout the talk examples are given to illustrate the practical use of the methods.**

Date: 13th of July, 2017

Speaker: Irene Gijbels

University of Leuven (KU Leuven)

Location: AGR Carslaw 829

Date: 13th of July, 2017

Speaker: Irene Gijbels

University of Leuven (KU Leuven)

Location: AGR Carslaw 829

*Title: Robust estimation and variable selection in linear regression**Abstract:***Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. We provide a sparse estimator of the cointegrating vectors. Sparsity means that some elements of the cointegrating vectors are estimated as exactly zero. The sparse estimator is applicable in high-dimensional settings, where the time series length is short relative to the number of time series. Our method achieves better estimation accuracy than the traditional Johansen method in sparse and/or high-dimensional settings. We use the sparse method for interest rate growth forecasting and consumption growth forecasting. We show that forecast performance can be improved by sparsely estimating the cointegrating vectors.**

Date: 30th of June, 2017

Speaker: Ines Wilms

University of Leuven (KU Leuven)

Location: AGR Carslaw 829

Date: 30th of June, 2017

Speaker: Ines Wilms

University of Leuven (KU Leuven)

Location: AGR Carslaw 829

*Title: Sparse cointegration**Abstract:*Joint work with Christophe Croux.

Date: 19th of May, 2017

Speaker: Dianne Cook

Monash University

Date: 19th of May, 2017

Speaker: Dianne Cook

Monash University

*Title:The glue that binds statistical inference, tidy data, grammar of graphics, data visualisation and visual inference**Abstract:*Buja et al (2009) and Majumder et al (2012) established and validated protocols that place data plots into the statistical inference framework. This combined with the conceptual grammar of graphics initiated by Wilkinson (1999), refined and made popular in the R package ggplot2 (Wickham, 2016) builds plots using a functional language. The tidy data concepts made popular with the R packages tidyr (Wickham, 2017) and dplyr (Wickham and Francois, 2016) completes the mapping from random variables to plot elements.

Visualisation plays a large role in data science today. It is important for exploring data and detecting unanticipated structure. Visual inference provides the opportunity to assess discovered structure rigorously, using p-values computed by crowd-sourcing lineups of plots. Visualisation is also important for communicating results, and we often agonise over different choices in plot design to arrive at a final display. Treating plots as statistics, we can make power calculations to objectively determine the best design.

This talk will be interactive. Email your favourite plot to dicook@monash.edu ahead of time. We will work in groups to break the plot down in terms of the grammar, relate this to random variables using tidy data concepts, determine the intended null hypothesis underlying the visualisation, and hence structure it as a hypothesis test. Bring your laptop, so we can collaboratively do this exercise.

Joint work with Heike Hofmann, Mahbubul Majumder and Hadley Wickham

Date: 5th of May, 2017

Speaker: Peter Straka

University of New South Wales

Date: 5th of May, 2017

Speaker: Peter Straka

University of New South Wales

*Title: Extremes of events with heavy-tailed inter-arrival times**Abstract:*Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

Date: 28th of April, 2017

Speaker: Botond Szabo

Leiden University

Date: 28th of April, 2017

Speaker: Botond Szabo

Leiden University

*Title: An asymptotic analysis of nonparametric distributed methods**Abstract:*In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several `local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

About the speaker:

Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart. His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: http://math.bme.hu/~bszabo/index_en.html .

Date: 7th of April, 2017

Speaker: John Ormerod

Sydney University

Date: 7th of April, 2017

Speaker: John Ormerod

Sydney University

*Title: Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?**Abstract:*We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

Date: 7th of April, 2017

Speaker: Shige Peng

Shandong University

Date: 7th of April, 2017

Speaker: Shige Peng

Shandong University

*Title: Data-based Quantitative Analysis under Nonlinear Expectations**Abstract:*Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a `nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called `phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

Date: 17th of March, 2017

Speaker: Joe Neeman

University of Texas Austin

Date: 17th of March, 2017

Speaker: Joe Neeman

University of Texas Austin

*Title: Gaussian vectors, half-spaces, and convexity**Abstract:*Let \(A\) be a subset of \(R^n\) and let \(B\) be a half-space with the same Gaussian measure as \(A\). For a pair of correlated Gaussian vectors \(X\) and \(Y\), \(\mathrm{Pr}(X \in A, Y \in A)\) is smaller than \(\mathrm{Pr}(X \in B, Y \in B)\); this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from \(A\) is stochastically dominated by its exit time from \(B\).

We will discuss these (and other) inequalities using a kind of modified convexity.

Date: 3rd of March, 2017

Speaker: Ron Shamir

Tel Aviv University

Date: 3rd of March, 2017

Speaker: Ron Shamir

Tel Aviv University

*Title: Modularity, classification and networks in analysis of big biomedical data**Abstract:*Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

Date: 31st of January, 2017

Speaker: Genevera Allen

Rice University

Date: 31st of January, 2017

Speaker: Genevera Allen

Rice University

*Title: Networks for Big Biomedical data**Abstract:*Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

### Information for visitors

Enquiries about the Statistics Seminar should be directed to the organizer John Ormerod.

Last updated on 22 May 2017.