2024-2025

Les séminaires de STATQAM ont lieu à 15h30 (Heure de l’Est), certains seront en présentiel au PK-5115 et d’autres en ligne via Zoom.

Merci de contacter Michaël Lalancette (lalancette.michael@uqam.ca) si vous voulez être ajouté à la liste de diffusion des séminaires.

Session Automne 2024

12 septembre : Ryan Campbell (Lancaster University)

Titre : New developments for a geometric approach to multivariate extremal inference.

Résumé : Multivariate extreme value inference focuses on modelling several simultaneous processes while taking into account their extremal dependence. That is, considering the behaviour of all combinations of processes as their values grow large. Until recently, different types of extremal dependence required different modelling procedures, resulting in a lack of a unifying approach to tackle multivariate extremes. A recent development in multivariate extremes remedies this by using the geometry of the dataset to perform inference on the multivariate tail. A key quantity in this inference is the gauge function, whose values define this geometry. Inference for the geometric approach relies on a pseudo radial-angular decomposition of random vectors in light-tailed margins: modelling radii conditioned upon angles, the gauge function appears as a rate parameter in a truncated gamma model. In this talk, I’ll present two methods to estimate the gauge function given data. The first relies on parametric assumptions on the form of the gauge function. The second is semi-parametric, interpolating the domain of the gauge function in a piecewise-linear fashion. This results in a simple construction that is flexible on data with extremal dependence behaviour that is difficult to parameterise, and works better in high-dimensions. The piecewise-linear gauge function can be used to define both a radial and an angular model, allowing for the joint fitting of extremal pseudo-polar coordinates. This new methodology is applied to environmental datasets, a setting where classical multivariate extremes methods often struggle due to the potential combination of dependence and independence in the joint tails.

This is joint work with my PhD supervisor, Jennifer Wadsworth.

19 septembre : Mufan Li (Princeton University)

Titre : The Proportional Scaling Limit of Neural Networks

Résumé : Recent advances in deep learning performance have all relied on scaling up the number of parameters within neural networks, consequently making asymptotic scaling limits a compelling approach to theoretical analysis. However, current research predominantly focuses on infinite-width limits, unable to adequately analyze the role of depth in deep networks. In this talk, we explore a unique approach by studying the proportional infinite-depth-and-width limit.

Firstly, we show that large depth networks necessarily require a shaping of the non-linearities to achieve a well-behaved limit. We then characterize the limiting distribution of the shaped network at initialization via a stochastic differential equation (SDE) for the feature covariance matrix. Furthermore, in the linear network setting, we can characterize the spectrum of the covariance matrix in the large data limit via a geometric variant of Dyson Brownian motions.

3 octobre : Marouane Il Idrissi (Université du Québec à Montréal)

Titre : Interprétabilité des modèles boîtes-noires avec variables dépendantes

Résumé : Comment peut-on interpréter un modèle boîte-noire, sans en connaître la forme ? Cette question est au centre de nombreux problèmes liés, en particulier, à l’utilisation des algorithmes d’apprentissage automatique dans les domaines sensibles. Dans cette présentation, nous nous intéresserons aux indices de Sobol’, qui ont pour but de quantifier l’importance des variables aléatoires d’un modèle. Cependant, ces indices, issus de la décomposition fonctionnelle d’Hoeffding, perdent leur sens dès lors que les variables ne sont pas mutuellement indépendantes. Les valeurs de Shapley, et en particulier la théorie des jeux coopératifs promettent d’offrir une solution à ce manquement. Cependant, nous verrons, par le biais d’exemples concrets issus du monde industriel, que ces solutions peuvent être trompeuses. Enfin, nous ouvrirons sur de récent développements, permettant de généraliser la décomposition d’Hoeffding, et qui ouvrent sur de nouvelles manières d’appréhender l’interprétation des modèles boîtes-noires.

10 octobre : Arthur Chatton (Université Laval)

Titre : Vérification de la présupposition causale de positivité

Résumé : L’inférence causale est un processus en deux étapes. D’abord vient l’identification qui permet de faire correspondre une association estimable avec les données à un effet causal conceptuel. Ensuite vient l’estimation. La présupposition causale de positivité — tout individu doit pouvoir recevoir les différentes modalités de traitement étudiées — est nécessaire pour ces deux étapes. Malheureusement, elle est souvent mise de côté dans les études observationnelles, vraisemblablement à cause de difficulté à la vérifier. Une violation de la positivité survient lorsque certains individus de l’échantillon présentent une probabilité trop extrême d’avoir une modalité de traitement. Nous avons développé un algorithme basé sur une succession d’arbres de régression de complexité croissante qui modélisent l’allocation du traitement selon les caractéristiques de l’échantillon afin d’identifié ces individus. Nous avons réanalysé quatre études publiées par notre équipe où des violations de positivité étaient suspectées pour en confirmer deux et infirmer une. L’algorithme a été récemment étendu au contexte longitudinal, où l’allocation du traitement varie avec le temps. Une étude sur l’initiation des traitement antirétroviraux chez les enfants positifs aux VIH a été réanalysée à son tour. Cet algorithme est un moyen facile et rapide de vérifier la positivité des études causales et peut s’adapter à des contextes plus complexes.

17 octobre : Sophie Dabo-Niang (Université de Lille)

Titre : Functional Data Analysis in Complex Dependencies: A PCA Approach for Learning Models

Résumé : Functional data, representing observations from complex processes, present significant challenges in modeling non-stationary time or spatially dependent phenomena such as curves, shapes, images, and other intricate structures. This talk focuses on Principal Component Analysis (PCA) tailored for complex functional datasets, including case-control studies, time series, and spatial data. We will explore the interplay between the functional characteristics and the inherent dependencies in the data, revealing underlying structures and patterns in stratified, spatial, or space-time datasets.

We will provide an overview of complex functional data, emphasizing their prevalence across diverse domains like environmental monitoring, geostatistics, and biomedical research. A key focus will be on the theoretical foundations of Functional Principal Component Analysis, highlighting its flexibility in analyzing dependent data.

We will present practical applications of functional PCA, particularly in identifying temporal or spatial dependencies, capturing variability, and reducing dimensionality. Real-world case studies will demonstrate the effectiveness of these techniques in various contexts.

Finally, we will address challenges associated with applying PCA to learning from complex functional data, such as managing infinite sample properties, data dependency, large datasets, computational demands.

24 octobre : Alex Stringer (University of Waterloo)

Titre : Two New Methods for Nonlinear Regression in Epidemiology and Environmental Toxicology

Résumé : I discuss two new methods involving additive models that are relevant to environmental epidemiology and toxicology. The first is a new cumulative exposure additive model for overdispersed count data in which the covariate being smoothed is the integrated weighted exposure to a pollutant. The weight function and the regression function are both unknown and modelled using penalized splines. The method is used to analyze several years of daily health outcome counts and their association with cumulative exposure to three air pollutants in various regions across Canada, as part of an active collaboration with Health Canada in support of the Air Health Trend Indicator project. The second is a new approach to the determination of allowable doses in environmental toxicology. The dose-response curve is fit using monotone splines and the benchmark dose and lower limit are obtained using fast implementations of Newton’s method that make use of de Boor’s algorithm for spline curve evaluation. The method is applied to the study of prenatal alcohol exposure and child cognition using data from six NIH-funded longitudinal cohort studies. The common theme of efficient computation with splines unites these two seemingly unrelated methodologies. If time permits, I will also discuss ongoing efforts to develop general hypothesis tests for linearity in multiple-component additive models and of zero variance components in random effects models more generally. Based on joint work with Tianyi Pan, Glen McGee, Tugba Akkaya Hocagil, Richard Cook, Louise Ryan, Sandra and Joseph Jacobson, and Jeffrey Negrea.

7 novembre : Andrew McCormack (Technical University of Munich)

Titre : The Unbiasedness Threshold

Résumé : Applications of linear algebra in statistics abound, such as those in linear regression and principal components analysis. Moving beyond linearity, the field of algebraic statistics leverages tools from computational algebra and algebraic geometry to solve statistical problems that involve polynomial functions. In this work I examine statistical hypothesis testing for discrete data from an algebraic perspective, with a focus on questions of the existence of unbiased tests. The sample size needed for the existence of a strictly unbiased test, termed the unbiasedness threshold, is shown to be the minimum degree of a polynomial that separates the null and alternative hypothesis sets. In particular, this result implies that null hypothesis sets must be semialgebraic for there to exist a strictly unbiased test. Explicit sample size requirements for various hypotheses in a multinomial model, such as hypotheses of independence in contingency tables, are given. It is demonstrated that upper bounds for the unbiasedness threshold can be found by computing Gröbner bases, and that such upper bounds are tight when all polynomial power functions can be written as sums of squares.

21 novembre : Lawrence McCandless (Simon Fraser University)

Titre : A comparison of Bayesian and conventional quantile regression for modelling the effect of chronic medical conditions on depression symptoms in Canadian adolescents

Résumé : Bayesian quantile regression is an emerging alternative to conventional quantile regression with important computational advantages. However, it has been rarely used in epidemiology research because of the difficulties of doing Bayesian posterior simulation and, additionally, because the method involves an unusual form of model misspecification. In this paper, I investigate Bayesian quantile regression using the Stan programming environment and compare the results with conventional quantile regression. I apply the method in a data example that estimates the effect of chronic medical conditions on depression symptoms in Canadian adolescents. This data is well-suited to demonstrating the properties of quantile regression because it has an unusual outcome variable that is interval scale continuous but taking values on the integers from 0, 1, 2, …, 27. This work makes new methodological contributions to our understanding of Bayesian quantile regression. First, I develop a novel Bayesian method for assessing the presence of heteroscedastic errors in the outcome variable. Second, I show the surprising result that Bayesian quantile regression may give dramatically different results compared to the conventional quantile regression estimator, even in large samples. This occurs because the point estimator from conventional quantile regression is calculated using the simplex algorithm of Barrodale and Roberts, which is heavily affected by discreteness of the outcome variable. In contrast, Bayesian quantile regression explores a continuous range of values for the unknown model parameters. I illustrate the advantages of inference using the full posterior distribution for inference rather than the conventional quantile regression point estimator by using a logarithmic scoring rule for probabilistic prediction. I demonstrate that inference based on the full posterior distribution for unknown parameters will often yield a better overall fit for the data compared to conventional quantile regression.

28 novembre : Julien Trufin (Université Libre de Bruxelles)

Titre : Predictive Modeling and Balance Property through Autocalibration

Résumé : Machine learning techniques provide actuaries with predictors exhibiting high correlation with claim frequencies and severities. However, these predictors generally fail to achieve financial equilibrium and thus do not qualify as pure premiums. Autocalibration effectively addresses this issue since it ensures that every group of policyholders paying the same premium is on average self-financing. This talk proposes to look at recent results concerning autocalibration. In particular, we present a new characterization of autocalibration which enables to identify whether a predictor is autocalibrated or not, we study a method (called balance correction) for obtaining an autocalibrated predictor from any regression model, we highlight the effect of balance correction on resulting pure premiums, and finally we go trough some performances criteria that are particularly relevant for autocalibrated predictors.

5 décembre : Luke Anderson-Trocmé (University of Chicago)

Titre : Des génomes aux géographies : études spatiales en génétique des populations

Résumé : Cette présentation explore comment le contexte spatial façonne la distribution de la variation génétique à travers divers systèmes biologiques. En prenant l’exemple de la population canadienne-française, nous verrons comment les rivières et montagnes ont influencé les voies migratoires et la dispersion génétique actuelle. Ces approches s’étendent également à des espèces non humaines, comme les jaguars d’Amérique du Sud, où des analyses spatiales éclairent les stratégies de conservation face à la fragmentation de l’habitat. En combinant génomique et modélisation spatiale, cette recherche ouvre de nouvelles perspectives sur les forces évolutives à l’origine de la diversité génétique.

Session Hiver 2025

30 janvier : Samuel Valiquette (Université McGill)

Titre : Modèle multivarié discret Tree Pólya Splitting

Résumé : L’analyse des données de comptage multivariées est fondamentale dans divers domaines. Un modèle approprié doit être en mesure d’être flexible pour induire la corrélation, mais également simple pour l’inférence et l’interprétation. Un tel modèle est celui du Pólya Splitting, qui divise aléatoirement la somme d’un vecteur discret en ses composantes. Cette approche simple offre plusieurs propriétés intéressantes. Cependant, sa structure de dépendance doit être similaire pour chaque variable. Pour surmonter cette limitation, une généralisation de ce modèle appelée Tree Pólya Splitting est proposée. Dans ce nouveau modèle, le processus de division est représenté par une structure arborescente, offrant ainsi une plus grande flexibilité. Lors de ce séminaire, nous définirons le modèle Tree Pólya Splitting et explorerons ses diverses propriétés, notamment les distributions marginales, les moments factoriels et la structure de dépendance.

Transparents de la présentation (pdf)

6 février : Mélina Mailhot (Université Concordia)

Titre : Allocation de risque basé sur la dépendance codale

Résumé : Dans cette présentation, il sera question de l’usage de la dépendance codale, dans le but d’attribuer le risque de chacune des composantes d’un portefeuille composé de risques dépendants, assurables et financiers. Dans un premier lieu, nous nous intéresseront à l’identification automatisée de valeurs extrêmes, avec une application aux réserves actuarielles. Ensuite, nous verrons comment la dépendance codale peut être utilisée afin de réduire la dimension lors de modélisation multivariée, appliquée l’assurance agricole. Nous terminerons avec l’utilisation des coefficients de dépendance codaux afin d’allouer du capital de risque d’un portefeuille de cryptomonnaie.

13 février : (ANNULÉ) Josée Dupuis (Université McGill)

Titre : Novel Statistical Approaches to Exploit Family History Information to Improve Power to Detect Rare Genetic Variant Associations

Résumé : The growing availability of sequencing data has enabled the investigation of the role of rare variants in disease etiology. However, detecting associations with rare variants or groups of rare variants requires large sample sizes for adequate power, especially for late-onset diseases, when the number of cases in cohorts of younger participants may be low. Family history (FH) contains information on the disease status of relatives, adding valuable information about the probands’ health problems and risk of diseases. Incorporating data from FH is a cost-effective way to improve statistical evidence in genetic studies and overcome limitations in study designs with insufficient cases. We proposed a family history aggregation unit-based test (FHAT) and optimal FHAT (FHAT-O) to exploit available FH for rare variant association analysis. We also proposed a robust version of FHAT and FHAT-O for unbalanced case-control designs. By applying FHAT and FHAT-O to the analysis of all-cause dementia and hypertension using the exome sequencing data from the UK Biobank, we show that our methods can improve significance for known regions.

20 février : Samuel Perreault (University of Toronto)

Titre : Estimation paramétrique de la distribution d’un processus saisonnier en fonction de la période de l’année

Résumé : Nous proposons une méthode d’estimation paramétrique de la distribution d’une variable environnementale, telle que le débit d’une rivière, en fonction du moment de l’année. Notre approche vise à estimer simultanément cette distribution pour chaque instant du cycle saisonnier, sans modéliser explicitement la dépendance temporelle présente dans les données. Pour ce faire, nous adoptons un cadre inspiré des GAMLSS (Generalized Additive Models for Location, Scale, and Shape), où les paramètres de la distribution varient au fil du cycle saisonnier en fonction de variables explicatives dépendant uniquement du temps de l’année, et non des valeurs passées du processus étudié. Ignorer la dépendance temporelle simplifie grandement la modélisation mais pose des défis d’inférence que nous clarifions et pour lesquels nous proposons des solutions adaptées. Notre approche est motivée par l’étude du débit des rivières, et plus particulièrement par l’utilisation de la distribution gamma généralisée pour leur modélisation. L’application de notre méthode a montré la nécessité d’étendre cette famille pour y inclure la distribution log-normale, jusqu’ici uniquement considérée comme un cas limite. Nous présentons cette extension ainsi que certains aspects computationnels à prendre en compte lors de son implémentation. Ce travail a été réalisé en collaboration avec Silvana Pesenti et Nancy Reid.

13 mars : Vanessa McNealis (University of Glasgow)

Titre : Inférence causale en présence d’interférence réseau : défis et avancées méthodologiques pour les données de santé publique

Résumé : La majeure partie de la littérature sur l’inférence causale repose sur l’hypothèse SUTVA, qui stipule l’absence d’interférence entre les individus. Pourtant, dans de nombreux contextes en santé publique, cette hypothèse est irréaliste. Par exemple, dans le cadre d’une intervention de prévention, le statut vaccinal d’un individu peut affecter indirectement le risque d’infection de ses contacts au sein d’un réseau social. Outre la confusion non mesurée, l’estimation d’effets causaux dans un tel contexte peut être compromise par plusieurs pièges potentiels, incluant la confusion par homophilie, la confusion contextuelle, l’autocorrélation et l’incertitude du graphe sous-jacent. Dans cet exposé, je discuterai d’approches récentes pour aborder ces enjeux en mettant l’accent sur les développements issus de ma thèse ainsi que leurs applications à des données en éducation et en santé publique. Je discuterai également des perspectives futures et des défis ouverts dans ce domaine.

20 mars : Stanislav Volgushev (University of Toronto)

Titre : Comparing many functional means (joint work with Colin Decker and Dehan Kong)

Résumé : Many modern medical devices produce data with the structure of a multi-channel functional time series. Examples include medical imaging devices (fMRI, EEG, and ECG), high through-put time course gene sequencing devices, and high through-put devices that measure time-course microbiome composition. The typical number of channels can be of the same order or larger than the available sample size.

In this talk, we will present methodology to simultaneously test the equality of a growing number of functional means, in the example above, each mean corresponds to a channel. The number of channels can grow exponentially in the sample size. The proposed test is fully functional in the sense that we do not conduct any explicit dimension reduction or principal component analysis. The practical implementation is based on a Gaussian multiplier procedure and we provide explicit bounds on the speed of convergence of the rejection probability of our test to the nominal value under the null and power against local alternatives. Our theoretical analysis leverages recent advances in high-dimensional Gaussian approximation but requires several intricate modifications of those techniques.

3 avril : Sophia Yazzourh (Université McGill)

Titre : Apprentissage par renforcement pour la médecine de précision : intégrer les connaissances médicales dans les algorithmes de décision

Résumé : La médecine de précision vise à adapter les traitements aux caractéristiques individuelles des patients. Cette approche repose sur le cadre des Dynamic Treatment Regimes (DTR), ou stratégies dynamiques de traitement, qui cherchent à déterminer une règle de décision optimale à chaque étape d’intervention. La construction de ces règles s’appuie sur diverses méthodes, notamment l’inférence causale, les modèles bayésiens et l’apprentissage automatique.

Cette présentation se concentrera sur l’apprentissage par renforcement (RL) dans le contexte des DTR. Plus particulièrement, nous expliquerons pourquoi, parmi les différents algorithmes disponibles, le Q-learning se révèle particulièrement adapté aux défis posés par les données médicales. Toutefois, l’application réelle de ces approches en milieu clinique suscite des réticences, tant chez les praticiens que chez les patients, en raison de la perception des méthodes d’apprentissage comme des « boîtes noires », dont les décisions sont difficiles à interpréter.

Nous explorerons ainsi un axe d’amélioration clé : l’intégration des connaissances médicales dans les modèles de RL. À cette fin, nous introduirons une méthode probabiliste de construction des récompenses basée sur les préférences des experts médicaux. Appliquée à des études de cas sur le diabète et le cancer, cette approche permet de générer des récompenses exploitant les données tout en évitant les biais de conception manuelle, assurant ainsi une meilleure cohérence avec les objectifs médicaux.

10 avril : Philippe Goulet Coulombe (UQAM)

Titre : Dual Interpretation of Machine Learning Forecasts

Résumé : Machine learning predictions are typically interpreted as the sum of contributions of predictors. Yet, each out-of-sample prediction can also be expressed as a linear combination of in-sample values of the predicted variable, with weights corresponding to pairwise proximity scores between current and past economic events. While this dual route leads nowhere in some contexts (e.g., large cross-sectional datasets), it provides sparser interpretations in settings with many regressors and little training data-like macroeconomic forecasting. In this case, the sequence of contributions can be visualized as a time series, allowing analysts to explain predictions as quantifiable combinations of historical analogies. Moreover, the weights can be viewed as those of a data portfolio, inspiring new diagnostic measures such as forecast concentration, short position, and turnover. We show how weights can be retrieved seamlessly for (kernel) ridge regression, random forest, boosted trees, and neural networks. Then, we apply these tools to analyze post-pandemic forecasts of inflation, GDP growth, and recession probabilities. In all cases, the approach opens the black box from a new angle and demonstrates how machine learning models leverage history partly repeating itself.

17 avril : Kathleen Miao (University of Toronto)

Titre : Robust Elicitable Functionals

Résumé : Elicitable functionals and (strictly) consistent scoring functions are of interest due to their utility of determining (uniquely) optimal forecasts, and thus the ability to effectively backtest predictions. However, in practice, assuming that a distribution is correctly specified is too strong a belief to reliably hold. To remediate this, we incorporate a notion of statistical robustness into the framework of elicitable functionals, meaning that our robust functional accounts for « small » misspecifications of a baseline distribution. Specifically, we propose a robustified version of elicitable functionals by using the Kullback-Leibler divergence to quantify potential misspecifications from a baseline distribution. We show that the robust elicitable functionals admit unique solutions lying at the boundary of the uncertainty region, and provide conditions for existence and uniqueness. Since every elicitable functional possesses infinitely many scoring functions, we propose the class of b-homogeneous strictly consistent scoring functions, for which the robust functionals maintain desirable statistical properties. We show the applicability of the robust elicitable functional in several examples: in a reinsurance setting and in robust regression problems.

25 avril : Nancy Reid (University of Toronto) (exceptionnellement à la salle PK-R605 et en hybride sur Zoom, voir le lien ci-dessous)

Titre : When likelihood goes wrong

Lien Zoom : https://uqam.zoom.us/j/8528459916

Résumé : Inference based on the likelihood function is the workhorse of statistics, and constructing the likelihood function is often the first step in any detailed analysis, even for very complex data. At the same time, statistical theory tells us that ‘black-box’ use of likelihood inference can be very sensitive to the dimension of the parameter space, the structure of the parameter space, and measurement error in the data. This has been recognized for a long time, and many alternative approaches have been suggested with a view to preserving some of the virtues of likelihood inference while ameliorating some of the difficulties. In this talk I will discuss some of the ways that likelihood inference can go wrong, and some of the potential remedies, with particular emphasis on model misspecification.