ST 790 (001) Fall 2022 Advanced Special Topics

Slides, lecture videos, etc.

This page will be updated regularly with materials (e.g., slides and videos) for each lecture. Most recent materials are at the top.

Week 15b — 12/01/2022

Today’s lecture gave a (very) high-level recap of the material covered this semester. Thanks for your interest!

Week 15b lecture video and Slides

Remarks & references:

I mentioned that possibility measures are the simplest of imprecise probability models and that, according to Shafer, their consonance structure is compatible with statistical inference. But I should’ve also mentioned that I’m not adopting possibility theory as the framework for my inferential models because of their simplicity or what Shafer said. In fact, I think that possibility theory is “correct” for statistical inference, as I explain in Section 3 of my latest paper, with some theoretical support.
Current version of my paper describing the new, valid inferential model framework with partial priors can be found here.
The books by Nassim Taleb (e.g., Skin in the Game) are nice, they provide some further (and better) insight about the idea that I was trying to communicate at the end of today’s lecture.
“Skin in the game” is also related to Crane’s Fundamental Principle of Probability and Naive Probabilism.
Related to the comments I made towards the end of the lecture, I think statisticians create imaginary versions of the “practitioners” that would use the methods we develop. We tell ourselves that “practitioners” want methods that are simple, fast, and precise. In reality, I think practitioners primarily want their solutions to be right and, subject to that constraint, they prefer simple, fast, and precise. Since we haven’t been able to deliver on practitioners’ top priority, i.e., solutions that are right, we over-emphasize the secondary priorities.

Week 15a — 11/29/2022

Today’s lecture focused on Simpson’s paradox, which is tangentially related to imprecise probability and some of the foundational stuff we’ve been discussing in recent lectures, as I tried to explain. In general, I think there are some important points to be made here that students should be aware of and understand.

Week 15a lecture video and Slides

Remarks & references:

The wikipedia page about Simpson’s paradox has some nice examples, including the UC Berkeley gender bias illustration that I used in lecture, and some good references.
The wikipedia page about the sure-thing principle has the same businessman story (which is quoted from Savage’s book) that I gave in lecture, along with some references.
Two papers by Judea Pearl: Simpson’s paradox and Savage’s sure-thing principle.
Gong & Meng (Stat Sci 2021) draw a connection between Simpson’s paradox and sure-loss.

Week 14a — 11/22/2022

Today I focused on two questions we haven’t considered so far: (a) what if the model is uncertain? and (b) what if the model is imprecise in a certain sense? Most of the lecture was on (a), describing what this question means and how uncertainty about the model itself can be quantified using a valid IM. For (b), the coverage was high level, aiming just to shed light on the problem and how to approach it.

Week 14a lecture video and Slides

Remarks & references:

The wikipedia page about Occam’s razor gives a pretty good introduction.
My ISIPTA’19 paper about model uncertainty (unfortunately, without regularization) is here.
I mentioned in the lecture that one can think of the partial prior for the model index M similar to how one might think of the marginal prior distribution for M in a Bayesian analysis; of course, they have different mathematical properties (e.g., the latter has to sum to 1 over all models), but the rationale behind them is similar. This can help because there’s be a lot of work in recent years on Bayesian model selection (and structure learning more generally) in high-dimensional problems. In fact, my experience with these Bayesian (or Bayesian-like) solutions is what got me thinking about the partial prior idea in the first place. That is, one often has a good idea about how to regularize on the models, i.e., to choose a marginal prior for M, but it’s not clear at all what prior to take for the M-specific parameters and, unlike in low-dimensional cases, here the choice of the prior matters a lot, even asymptotically. So that got me thinking: could we just work with the marginal prior for M that we’re comfortable with, leaving the prior for the model parameters vacuous? You can’t do this with Bayes/probability, but you can with imprecise probability…
I don’t claim to be an expert on this, but I know that there’s tons of literature about missing and coarse data in statistics, machine learning, and imprecise probability. For example:
- Statistics: Classic references for missing data include Rubin (Biometrika 1976) and the book on missing data by Little and Rubin; there are lots of other papers about this, as a quick Google search will reveal. One I found that looks nice is Jaeger (JAIR 2005). For coarse data, I’m familiar with the paper Heitjan & Rubin (Ann Stat 1991) but there are many others, including Jacobsen & Gneiding (Ann Stat 1995) and Gill et al (1997). Of course, the literature on censored data analysis would address coarsening too.
- Imprecise probability: The papers that I’m familiar with about this include Denoeux (IEEE 2013), Couso & Dubois (IJAR 2018), and Guillaume & Dubois (IJAR 2020).
Nguyen’s book has a few details about coarse data and the use of random set models.
Note that the concept of missing data is crucial to causal inference. The reason is that questions about a causal relationship concern comparing what would’ve happened if the same patient received both treatments. But a subject can only receive one treatment so the response under the other treatment, the so-called potential outcome, is missing data.

Week 13b — 11/17/2022

Today we covered formal decision theory, starting with the basic setup, then considered different levels of probabilistic uncertainty quantification (UQ) about the state of the world. First is vacuous UQ, which aligns with non-Bayesian statistical decision theory; then fully probabilistic UQ, which aligns with Bayesian statistical decision theory (and the von Neumann–Morganstern theory); and then imprecise probabilistic UQ involving lower/upper expected utilities as Choquet integrals.

Note: There were issues again with the live lecture recording, so below is a re-recorded version.

Week 13b lecture video and Slides

Remarks & references:

My lecture closely follows the review article, Denoeux (IJAR 2019). There are lots of references about decision theory in the vacuous and probabilistic UQ contexts, but not too many discuss the imprecise-probabilistic UQ case; see Chapter 8 in Intro to IP. The review paper doesn’t really talk about statistical decision theory, but there are lots of familiar references on that, in particular, Berger’s 1984 Statistical Decision Theory and Bayesian Analysis book. I also have some notes on statistical decision theory — see Chapter 5 here — from previous courses that I taught.
More details and references on the Ellsberg Paradox can be found on its wikipedia page. There are also some details about the imprecise-probabilistic resolution (with numerical results) in Chapter 8 of Intro to IP.
My paper about valid IMs and decision theory is here. The paper I’m familiar with about decision theory in the context of fiducial inference is Taraldsen & Lindqvist (Annals 2013), but there may be some more recent follow-up paper by these authors.

Week 13a — 11/15/2022

Today I started with a brief discussion of the ML prospective on regression, then moved on to describe two specific imprecise-probabilistic approaches. The first, which I mentioned previously in a different context, is conformal prediction and strongly valid IMs. The majority of the lecture was spent describing a new approach based on random fuzzy sets. This required some background on fuzzy sets that we haven’t discussed before.

Note: There was another technical glitch so I had to re-record the lecture later after the regular class meeting. Unfortunately, there were a number of good questions from the in-person participants this time, but these didn’t make it into the recording 🙁

Week 13a lecture video and Slides

Remarks & references:

The paper connecting conformal prediction and (strongly) valid IMs is here.
Fuzzy set theory is a big area. In particular, this goes beyond imprecise probability, though clearly the two are related. For some more background, see, e.g., the wikipedia page, which provides a number of references, including a link to Zadeh’s seminal 1965 paper.
The idea of combining pieces of evidence/information via “intersection” is attractive, so that aspect of fuzzy set theory is appealing to me. While I had other reasons for doing so, I’ve used the fuzzy set intersection operation in various places in my work. Most recently, in the new IM developments, there’s a “relative likelihood” that I’m proposing (e.g., Slide 14 in the Week 09b lecture) is a (normalized) fuzzy set intersection of the information in the data (likelihood ratio) and the prior.
General details about random fuzzy sets/numbers can be found in Denoeux (Fuzzy Sets & Systems 2022+) and the specific application, evidential neural net regression, can be found in Denoeux (BELIEF’22). An earlier paper developing lower and upper probabilities based on random fuzzy sets is Cuoso & Sanchez (Fuzzy Sets & Systems 2011).

Week 12b — 11/10/2022

Today’s lecture continues with the discussion of classification, with a relatively high-level summary of two imprecise-probabilistic classifiers: Denoeux’s evidential neural net classifier, which involves belief functions, and another based on conformal prediction and possibility measures.

Week 12b lecture video and Slides

Remarks & references:

The two papers I’m basing my presentation of Denoeux’s work on are Denoeux (IEEE SMC 2000) and Tong, Xu, and Denoeux (Neurocomputing 2021). Details about IMs for valid (regression and) classification using nested random sets and possibility measures can be found here.
I’ve admitted that I don’t know all that much about deep learning but not because it’s too hard or because I don’t think it’s important. I want to explain this here since it might be semi-helpful “advice” for students. First, there’s only so many hours in the day, so every moment we spend studying one thing is a moment less we have to think about something else. In the last few years, I’ve had other priorities, so it was essential that I stay focused on those in order to make the kind of progress I was hoping for. Second, if deep learning really is the future, then I’ll have time to learn about it later, there’s no rush. Third, my opinion is that we don’t understand the simpler problems as well as we think we do, so my time is better spent elsewhere so that, if/when I choose to work on deep learning, then I’ll be in a better position to hopefully make a contribution.
At a couple points in lecture I mentioned “stochastic gradient descent” but I didn’t give any explanation. But in case you’re interested, SGD is basically just the simple Newton’s method you learned in a first calculus class, but with a few small tweaks to deal with the fact that you only have access to noisy evaluations of the gradient. (The idea is that, if you just use Newton’s method in the presence of noise, then it can happen that convergence fails because the noise dominates the dynamics.) Despite all the recent attention to this in the machine learning literature (no doubt, important advances have been made recently), this is a classical problem that was solved by statisticians back in the 1950s. Herbert Robbins (of empirical Bayes, compound decision theory, and … fame) and Sutton Monro, in a famous 1951 paper, developed what is now called “stochastic gradient descent” under the original name stochastic approximation, and this attracted a lot of statisticians’ attention in the 1950s and later; it’s a really cool algorithm, and the theoretical analysis is beautiful. My PhD thesis happened to be about a particular instantiation of stochastic approximation for the purpose of estimating unknown mixing distributions in mixture models. I gave a sort of “short course” a few years ago about stochastic approximation theory and (a biased collection of) applications; the slides are here if you’re interested.

Week 12a — 11/08/2022

This lecture shifts gears from the prediction-without-covariates to some more modern things, commonly referred to as supervised learning. I gave a high-level overview of these kinds of problems and some discussion of how this relates to statistics and, perhaps, to imprecise probability. I also gave a superficial explanation of one imprecise-probabilistic technique for classification, namely, Marco Zaffalon’s naive credal classifier.

Week 12a lecture video and Slides

Remarks & references:

I first learned about the minimum clinically important difference (MCID) problem from my friend and former colleague, Samad Hedayat, and their work in Hedayat et al (Biometrics 2015). This is a unique problem so, to my knowledge, it hasn’t been closely investigated by many others — Jiwei Zhao at University of Wisconsin Madison has some work on MCID specifically. But this problem is similar to many others that pop up in the personalized medicine context, so it’s a neat application.
The MCID problem above gave me considerable motivation, and a lot of the work that I’ve done in the last few years is related in some way to this problem. Around that time I learned about some old but not frequently used ideas about so-called Gibbs posterior distributions, and I realized that this was exactly suited for “probabilistic inference” in the MCID problem and in many others where the quantity of interest is defined as a risk minimizer rather than as a parameter in a statistical model with a likelihood. The first paper on this idea (specifically for MCID) is here, and then we realized that the mis- or under-specification of the model meant we had to manually tune the Gibbs posterior in order to achieve frequentist calibration, which lead to the work in Syring and M (Biometrika 2019). Since then, there’s been a deep theoretical investigation (here) into the asymptotic properties of general Gibbs posteriors, along with some (in my opinion) cool applications here, here, here, and here; the latter is again related to the MCID problem. The recent survey chapter gives a summary of these developments.
Basics about the naive Bayes classifier can be found here, along with references. Obviously, there are a ton of other resources on this available online and in books/papers.
A preprint of Zaffalon’s 2002 JSPI paper on the naive credal classifier is here; see, also, Chapter 10 in the Intro to IP book, which has more details on the version using Walley’s IDM like I discussed in lecture.

Week 11b — 11/03/2022

No class or video, NC State’s Chancellor called for a “wellness day” and canceled classes. Students can use this time to work on their course project, if they don’t need the time for other things.

Week 11a — 11/01/2022

This lecture digs a bit more into the basic prediction (without covariates) case, this time focusing on the multinomial or discrete nonparametric model. I think it’s important to go through this before getting into details about classification, etc. I’ll cover the basics of Walley’s imprecise Dirichlet model, Denoeux’s confidence region-driven belief function construction, and the previously-described valid predictive IM.

Week 11a lecture video and Slides

Remarks & references:

The special case of multinomial with two categories, i.e., binomial, is covered in detail in Walley’s book. The generalization to multinomial and the name “imprecise Dirichlet model” can be found in Walley (JRSS-B 1996).
Some details about Walley’s IDM can be found in Chapters 7 and 10 in the Intro to IP book; see, also, the Aslett et al text. There’s also an entire special issue of IJAR dedicated to the IDM.
An extension of Walley’s IDM to other exponential familiy models can be found in Quaeghebeur & de Cooman (ISIPTA 2005).
One my former students (Dr. Joyce Cahoon) has a chapter in her thesis on the use of the IDM for classification of tweets; unfortunately, that project was never really “finished”.
No doubt students are at least aware of the so-called Dirichlet process commonly used in Bayesian nonparametrics. The paper Ferguson (Annals 1973) is where this idea was first developed. This is a very nice paper, both technical and clear, and if you take a look you’ll see the simple Dirichlet-multinomial model plays a key role in this formulation.
Denoeux (IJAR 2006) is the paper on belief functions for multinomial prediction based on confidence regions.
Another nice paper on belief functions for prediction (and more) is Denoeux & Li (IJAR 2018).
The recent paper that develops a computational strategy capable of handling the Dempster-style belief function construction in the case of more than a few categories is Jacob et al (JASA 2021).

Week 10b — 10/27/2022

Wrap up the discussion on statistical inference and, in particular, on the new construction of strongly valid (consonant) IMs. Transition into discussion of prediction/classification, and talk about simple versions of nonparametric predictive inference and conformal prediction.

Week 10b lecture video and Slides

Remarks & references:

An IM construction for valid prediction under parametric models is presented here.
Frank Coolen’s nonparametric predictive inference is described here.
Shafer & Vovk (JMLR 2008) give a nice “tutorial” on conformal prediction.
Two papers discussing (successively more general) versions of conformal prediction and the connection to imprecise probabilities and IMs are here and here.
R code for the conformal prediction example in lecture is here (txt format).

Week 10a — 10/25/2022

Today we continued with our discussion of the new partial-prior inferential model (IM) developments. In particular, I talked about the kind of coherence property it satisfies and then turned to questions about efficiency. An example involving marginal inference on the odds ratio is a 2×2 table was presented.

Week 10a lecture video and Slides

Remarks & references:

The emphasis on dimension reduction for the sake of efficiency isn’t new, although it’s usually not phrased in this way in the usual presentations of statistical inference. In our previous IM developments, the emphasis on dimension reduction is made explicit; see, e.g., here, here, and Chapter 3 in our IM book.
There are some close connections to the new construction I described in the lectures here and what I/we had previously been calling generalized IMs. Some references about this are here, here, here, and here.
R code for the odds ratio example is here (txt format).

Week 09b — 10/20/2022

No in-class lecture, but there’s a video posted below; we’ll be back in regular class lectures next week. Today’s lecture covers the construction of a valid and efficient IM. I start with the original formulation that’s described in the IM book, then move on to some new material based on the recent partial-prior IM developments here and some even newer work that isn’t available in writing yet.

Week 09b lecture video and Slides

Remarks & references:

There’s an interesting relationship between the binomial and beta CDF; see here. It’s a consequence of integration by parts, a derivation that’s commonly given in intro probability courses. I used this in my derivation of the random set in the binomial example in lecture.
Leonardo Cella and I attempted to incorporate partial prior information into the “first valid IM construction” by suitably stretching the random sets; see here. This is a neat idea, but I think it’s difficult to imagine doing this in complicated problems.
It doesn’t contain all the details of the IM solution I described in the lecture today, but you can see the (short) conference proceedings paper here that lays out some of the computational details. I’m presenting this paper next week at the BELIEF 2022 conference in Paris; I won’t be in Paris though, just in my house.

Week 09a — 10/18/2022

Today we discussed generalized Bayesian inference, i.e., a set of posterior distributions determined by a set of priors, and its properties — in particular, validity and efficiency (or lack thereof).

Week 09a lecture video and Slides

Remarks & references:

The papers by Wasserman and Walley that I mention in this lecture are linked below under Week 06b.
…

Week 08b — 10/13/2022

Quick review of the statistical inference setup, then talked about the inferential model construction based on Dempster’s updating rule. Some properties were given, along with a simple example. In particular, I explained how the fact that Dempster’s rule can’t avoid sure loss leads to a lack of validity.

Had some more trouble with the recording during class yesterday, so I re-recorded the lecture today.

Week 08b lecture video and Slides

Remarks & references:

Presentation based on Dempster (IJAR 2008) and M. et al (Stat Sci 2010).
A one-stop-shop where you can find all of Dempster’s key works on statistical inference (along with many other interesting papers) is the book Classic Works of the Dempster-Shafer Theory of Belief Functions, which is available electronically to NC State students through the library.
The validity implies no-sure-loss result is presented here. Now I have a better understanding of what’s exactly is being proved there, so I’ll update that when I next get a chance.
Lots of very nice applications of the Dempster-Shafer framework are summarized in Cuzzolin’s book. And these applications are ongoing: there’s some new work on epistemic deep learning that I heard about recently, and an entire conference devoted to these a couple weeks from now.
There’s an interesting-but-technical paper by Sandra West (Ann Stat 1979) that carries out Dempster-style inference in a logistic regression model.

Week 08a — 10/11/2022

No class, for NCSU’s fall break.

Week 07b — 10/06/2022

Quick recap of some of the imprecise probability ideas that we’ve discussed so far, then transition from the general theory to applications of that theory in stat/ML. The first “application” that I want to focus on is statistical inference, which is my primary focus, what brought me to imprecise probability in the first place, and what motivated me to offer this course. We’ve mentioned a few of these ideas along the way, but today’s lecture starts to lay down some of those details more formally. Today’s lecture was mostly setup and motivation for considering inferences summarized as imprecise probabilities. More details in later lectures.

Week 07b lecture video and Slides

Remarks & references:

In the lecture I spotted a typo on the slide that presents the false confidence theorem; I’ve now corrected that in the slides posted above. I also elaborated a bit on the slide itself about what the phrase “false confidence” means.
The original paper that presents the false confidence theorem motivated by the satellite collision example is here; there was a comment about our paper by other authors here, and we responded to their comment here. I’ve also written about the false confidence theorem myself in a few places: here, here, and a bit here.
There are still a number of interesting open questions concerning the false confidence theorem. The one that I think is the most important is what kind of assertions A are afflicted by false confidence? My conjecture (which I probably explained in one of the references above) is that this appears when the assertion is about a non-linear feature of the full parameter, but it’s not that every non-linear feature has this. I also believe this to be related to (striking yet ignored) results on non-existence of suitable confidence sets in certain applications; see, e.g., Gleser & Hwang (Annals 1987). There are some related points in a very old paper by Pitman, but I’ve not had time to sort out these details. If you’re interested, let’s talk.

Week 07a — 10/04/2022

Today’s lecture was based on the recent paper by Gong & Meng (Stat Sci 2021) on updating rules in imprecise probability, in particular, on some of the “counter-intuitive” properties that Dempster’s rule and generalized Bayes rule can have. This discussion will serve as a good transition point from our discussion of the theory of imprecise probability to applications of that theory in statistics and machine learning.

Week 07a lecture video and Slides

Remarks & references:

I updated the slide on “gBayes can’t sharpen prior ignorance” with a sketch of the proof.
Ruobin Gong (one of the co-authors of this paper) gave a talk recently in the SIPTA Seminar Series on roughly the contents of their paper. I’m sure she can present this material better than I can, so if you want to hear her perspective, you can check out the video on YouTube.
Gong & Meng’s was a discussion paper which means that other folks were asked to write responses with their own opinions that the authors later responded to. There were four sets of discussants: Glenn Shafer, Gregory Wheeler, Thomas Augustin & Georg Schollmeyer, and Liu & M. All four of these responses (or at least the first 3!) are really insightful additions to the original paper, I highly recommend them all. The background you now have on imprecise probability should make these accessible to read. The authors responded in their rejoinder.
Many of the properties about Dempster’s and generalized Bayes rules presented in Gong & Meng’s paper were previously known; see, e.g., Section 3.1.4 in Cuzzolin’s book where a result due to Kyburg is presented that captures Lemma 4.3 in Gong & Meng.

Week 06b — 09/29/2022

Continue with discussion of lower previsions. Focus is on two ways to modify an existing lower prevision (for two distinct purposes): natural extension and conditioning. The latter leads to the notion of Generalized Bayes, which has some direct statistical applications, as we’ll discuss later. Note: my coverage of conditional lower previsions won’t be comprehensive or fully rigorous.

Week 06b lecture video and Slides

Remarks & references:

Chapter 7 in the Introduction to Imprecise Probability text describes the generalized Bayes rule in statistical inference in a few different cases, with a lot of emphasis on Walley’s imprecise Dirichlet model, which I hope to have time to discuss at least briefly in class.
Generalized Bayes is more general than this, but often where this appears in statistics is under the name robust Bayes. The idea is that we might have a collection of prior distributions under consideration, and application of Bayes’s rule to each of them yields a corresponding collection of posterior distributions. Then the lower and upper envelopes (like generalized Bayes rule) can be used to summarize inference and assess how sensitive the posterior is to the prior. Jim Berger did a lot of work on this in the 1980s and 90s (see Chapter 4.7 in his book). The challenge is in being able to compute the lower and upper envelopes for a given credal set of priors. The Wasserman 1990 paper that I referred to in lecture a while back describes this computation for certain class of priors.
There’s a nice paper by Walley, in particular, Walley (JSPI 2002), that uses generalized Bayes in a particularly clever way. I’ll talk about this specifically in lecture soon.
I give some details about generalized Bayes in the context of my notion of validity (a particular kind of reliability) in this paper. I’ll have more to say about this later.
I didn’t give enough details in lecture to make the following claim precise, but there are situations in which “joint coherence” of a conditional and unconditional lower prevision is only satisfied when the conditional lower prevision is defined via the generalized Bayes rule. See Theorem 6.5.4 in Walley’s book or the remarks at the top of page 50 in the Intro to IP book.

Week 06a — 09/27/2022

Quick overview of De Finetti’s formulation in terms of previsions (expectations) and then start on the imprecise version of De Finetti’s general theory, i.e., the theory of lower previsions (lower expectations), largely due to Peter Walley, starting back in the 1980s.

Note: There was a technical glitch with the video I recorded in class yesterday, so I had to record a new version, which is available below. I made some minor changes to my presentation (slides and lecture) of “desirability” because I don’t think what I said in class was all that clear. Sorry for the inconvenience.

Week 06a lecture video and Slides

Remarks & references:

Most of the material I presented in lecture is from Chapter 2 (“Lower Previsions”, by Miranda & De Cooman) in the Augustin et al book. This book is available electronically to NC State students through the library.
A nice resource is the set of introductory slides prepared by Enrique Miranda, available here.
Chapter 1 of the Augustin et al text (by Erik Quaeghebeur) talks about the theory of desirability which is an alternative but equivalent approach to what we’re doing. That is, one can use a given coherent lower previsions to define a set of desirable gambles and you can use a set of desirable gambles to define a coherent lower prevision. So the choice is mostly a matter of taste. For us here who are familiar with probability (and some imprecise probability), I think the lower previsions approach is more accessible, but that’s just me.

Week 05b — 09/22/2022

Today I gave some more specific details about coherence and how it relates to properties of the associated credal set. These general questions about coherence are better posed in terms of “upper expectations” than “upper probabilities”, so this discussion naturally leads to notions of expected values in imprecise probabilities, in particular, the Choquet integral. Also, the next imprecise probability model we’ll discuss, namely, lower/upper previsions, takes expectations as its starting point.

Week 05b lecture video and Slides

Remarks & references:

The material in this lecture was difficult for me to put together because I wasn’t able to find a good reference that covered exactly what I wanted to say. (Maybe that’s a sign that I should’ve chosen to say different things…) There probably are references out there, I just didn’t find them, so if anyone reading this has a recommendation, please let me know. Anyway, the challenge was that we’ve been talking about imprecise probabilities whereas the references that discuss the details of coherence tend to focus on imprecise expectations (i.e., lower/upper previsions). There’s good reason for the focus on expectations, but I wanted to at least partially address the coherence question directly in terms of imprecise probability. The best reference I was able to find for this was Chapter 10 in Huber & Ronchetti’s Robust Statistics book (which is available electronically to NC State students through the library). Their presentation doesn’t talk about “no sure loss” or “coherence” specifically, but the theorems in the lecture are adapted from Huber & Ronchetti’s.
In today’s lecture especially, but in others before, I’ve emphasized the importance of coherence for various reasons. But I need to make sure it’s clear that coherence is not a strong property, it’s only a necessary condition for an assessment of uncertainty to be “rational”. That is, if we accept the behavioral interpretation of lower/upper probabilities in terms of prices for bets, then it makes sense to require that the pricing scheme satisfy certain properties, e.g., no sure loss. But there are lots of pricing schemes that would meet these basic requirements, so coherence alone doesn’t mean much. The point is that coherence and other behavioral properties are about internal/subjective consistency — in particular, coherence doesn’t tell us anything about the “real world”, if that’s even possible — so it’s up to us to build on the additional structure we need for whatever applications we have in mind.

Week 05a — 09/20/2022

After a quick recap on belief functions, we discussed the connections with possibility measures and then I gave a characterization of a belief function’s credal set in terms of allocations. Then I describe Dempster’s rule of combination more formally, and how the combination of a prior belief function and the belief function that describes an “observation” leads to a sort of generalization of Bayes’s rule.

Week 05a lecture video and Slides

Remarks & references:

While consonant belief functions are very special cases, it’s interesting that consonance lines up nicely with statistics. Chapter 10 in Shafer’s 1976 book focuses exclusively on the statistical inference problem where he constructs a consonant belief function to suit his purposes. Something more is needed beyond what Shafer puts forward there, but I know what’s needed and we’ll discuss it in class later. Besides that consonance might be “right” for statistics, it’s great that we only need to know how to compute with the most basic of imprecise probability models — even belief functions on (large) finite frames are computationally demanding.
Consonant approximations of general imprecise probabilities is something that I’m specifically working with now, and I’ll go over this in lecture later. Some details on consonant approximations of belief functions can be found in Dubois & Prade’s 1990 paper.
For the general details about allocations of probability, see Shafer’s 1979 paper. If I’m being honest, I can’t follow what Shafer is presenting here, or at least I couldn’t in all of my previous attempts. However, after preparing for this week’s lectures, I might have some better intuition (e.g., allocations are basically mixtures), which could help make the technical details more manageable.
I didn’t realize it before, but the formula for “P^*” that I presented in Week 03b, slide 11, is a type of allocation too. If the P_0 there is Lebesgue measure, then this is very much like the “uniform allocation” I mentioned in today’s lecture.
Cuzzolin’s book has lots of details about Dempster’s rule of combination and belief functions more generally. I also like Dempster (IJAR 2008) as it gives a very systematic presentation of the framework with a focus on statistics problems; also, Dempster’s proposal of a “(p,q,r)” triple to quantify the imprecision in the belief function output is interesting. We’ll do some more with belief functions a little later in the course.

Week 04b — 09/15/2022

Finished up details from previous slides on the extension principle, then moved on to new material on belief functions in the slides below. After a bit about the origins and different perspectives of belief function theory, aka Dempster–Shafer theory, I introduced some of the basic terminology and notation. Since the focus is on finite frames, this is really no different than what we did with finite random sets before. An important feature of this framework is the formal combination of different bodies of evidence, and I gave a heuristic illustration of how this works at the end of today’s lecture. A more formal treatment will be given in the next class.

Week 04b lecture video and Slides

Remarks & references:

Only after seeing the extension principle did I realize how important possibility theory is to statistics. Like I mentioned in Week 01, we’re often doing this supremum operation and the justification for it is, more or less, that “the supremum operation works” for the marginalization purpose it’s intended for. So a lightbulb went off for me when I saw that this mysterious supremum is exactly how marginalization is done in possibility theory.
The example at the end of the Week 04a slides is closely related to the false confidence theorem I presented in one of the early lectures and, in particular, to what I say in Section 3 of this paper. The key take-away message from this example can be succinctly stated as follows: good statistical properties might be lost when marginalization is carried out via the probability calculus (integration), but not when carried out by the possibility calculus (optimization, via the extension principle). In this example, while the original posterior for \theta has good statistical properties, the marginal posterior distribution for \phi (the squared length of \theta) derived from it does not; however, the “posterior possibility distribution” for \theta, obtained via outer approximation, has good statistical properties and so does the derived marginal posterior possibility distribution for \phi.
We’ll talk more about this later, but you’re welcome to get a jump on it now if you’re interested: a version of the probability-to-possibility transform is crucial to my proposed “partial prior” inference framework in this paper, see Section 7. This is even more important than I realized back when I was writing that paper, and I’ll have something to say about my current/ongoing work on this later.
I mentioned the Dempster–Shafer theory is popular in CS and related fields and, while it’s true that this theory isn’t “popular” in statistics, you shouldn’t interpret this as me saying that it couldn’t or shouldn’t be popular. In fact, there was a discussion paper published in JASA just last year that developed a new computational strategy for a problem motivated by Dempster–Shafer theory and statistical inference. (Incidentally, one of the co-authors of that paper, Ruobin Gong, is a friend of mine and fellow proponent of imprecise probability in statistics; she has another paper that we’ll discuss more soon on some of the “paradoxes” of imprecise probability.)

Week 04a — 09/13/2022

Proved the credal set characterization result from Week 03b, then moved on to new material on probability-to-possibility transforms. Next lecture I’ll finish with my coverage of possibility theory — what’s left is the important extension principle — and then we move on to belief functions.

Week 04a lecture video and Slides

Remarks & references:

I had been sloppy with my definition of maxitive in the slides; I think at one point I said out loud that we’d need to extend the definition to “countable maxitivity” if we were dealing with possibility measures on an infinite space, but then I continued to use the definition that only works for finite spaces. I updated the Week 04a (and Week 03b) slides to include the proper definition that covers finite and infinite spaces.
A nice paper that describes the probability-to-possibility transform is Dubois et al (2004). This and a more general imprecise-probability-to-possibility transform are discussed in Hose & Hanss (2021), which was a very influential paper for me and my understanding of possibility theory. In fact, after seeing some of the results in this paper, I was able to resolve a question that had been troubling me concerning statistical inference — see Section 7 in my recent paper, in particular, the thing I call “validification” should look semi-familiar.
The result in Section 3.2 of the Dubois et al (2004) paper linked above can be used to formalize the argument I asked you to informally make on Problem 1 in Homework 1.

Week 03b — 09/08/2022

Finished up the introduction to possibility theory from Week 03a and then move on to some more details, in particular, a characterization of the contents of a possibility measure’s credal set.

Note: Unfortunately, I spent too long on an example at the beginning of lecture that ultimately failed. My argument was effectively right but I overlooked a detail that I was unable to resolve during lecture. So you can skip roughly the first 25 minutes of the lecture video and read the details of that example here.

Week 03b lecture video and Slides (minor update on 09/13/2022)

Remarks & references:

Shackle’s potential surprise formulation of uncertainty quantification and his presentation is pretty interesting. It’s in his book Decision, Order, and Time, published in 1961, and probably elsewhere. Modern presentations of possibility theory are much more directly applicable to statistics and machine learning problems, but there’s still value in reading the originals. Of course, I had heard of Shackle from reading about imprecise probability but it wasn’t till I saw his name mentioned in Nassim Taleb’s Black Swan that I thought to check out Shackle for myself. A line where Taleb mentions Shackle (also quoted on Shackle’s Wikipedia page) is here: “Tragically, before the proliferation of empirically blind idiot savants, interesting work had been begun by true thinkers, the likes of J. M. Keynes, Friedrich Hayek, and the great Benoit Mandelbrot, all of whom were displaced because they moved economics away from the precision of second-rate physics. Very sad. One great underestimated thinker is G.L.S. Shackle, now almost completely obscure, who introduced the notion of ‘unknowledge’, that is, the unread books in Umberto Eco’s library. It is unusual to see Shackle’s work mentioned at all, and I had to buy his books from secondhand dealers in London.”
A nice introduction to possibility theory is given in Dominik Hose’s 2022 PhD thesis at the University of Stuttgart. I was on Dominik’s thesis committee, which is why I’m familiar with his work. Much of what I’m going to present about possibility theory in my lectures is based on Dominik’s thesis. Unfortunately, there are some absurd copyright restrictions that prevent me from posting a copy of his thesis here publicly on the course website. I’ll find a work-around and update this remark later.

Week 03a — 09/06/2022

NOTE: Weird technical glitch in today’s lecture video — roughly the first 8 minutes have a black screen. During those few minutes, I’m talking about the slide that defines maxitivity in the Week 02b file. After I switch the screen to the iPad (to prove that maxitivity implies 2-alternating), everything is fine. Since it’s just a few minutes of black and only 1 slide, I don’t think this is a major issue, so I’ll keep the video as is. Sorry for the inconvenience 🙁

Finished up details on maxitive capacities from the Week 02b slides, then one more example related to random sets, one that’s relevant in robust statistics (i.e., contamination neighborhoods). Then moved on to the new material on possibility theory, which is a special but very important/powerful imprecise probability framework. Unfortunately, I didn’t get as far into this as I’d hoped, but that’s OK.

Week 03a lecture video and Slides

Remarks & references:

In one of my first papers, published in Statistical Science, we compared Dempster’s framework for statistical inference with an early version of what would later be called inferential models. More on our framework (and the name) later in the course.
As I mentioned in the lecture, “contamination neighborhoods” are commonly used in robust statistics. In Bayesian robustness, for example, it is typical to take the contamination neighborhood to be a class of prior distributions centered on some “ideal prior”; see, e.g., Wasserman’s 1990 article, especially Example 5.2. In non-Bayesian statistics, the contamination neighborhoods are also common and I’ll simply refer you Huber & Ronchetti’s Robust Statistics book which, incidentally, has a good bit of discussion about capacities and other imprecise probabilities; they even have a relatively short proof of the claim that 2-monotone capacities are coherent, but I don’t feel like their argument is transparent enough to present in the lecture.
A nice, high-level introduction to possibility theory (with a slant towards statistics) is in Didier Dubois’s 2006 survey paper published in the special issue of Computational Statistics & Data Analysis on possibility theory and, more generally, “The Fuzzy Approach to Statistical Analysis”. Students might browse through some of the papers in this special issue for project-related references.
Some of the possibility theory literature focuses on what’s called qualitative possibility, i.e., where the goal is simply to order assertions based on their degree of possibility rather than assign numerical possibility values to them. We won’t discuss this in ST790, mainly because our focus is on “uncertainty quantification” (emphasis on quantification), but this perspective is interesting and important. In fact, this is related to my brief rant in the following sense: it’s naive to expect that, in complex problems, meaningful measures of uncertainty can be expressed simply as numbers. The same Crane who coined the term Naive Probabilism also has written (e.g., here, with an imprecise probability focus) on an abstract description of probability, not as a number but as, roughly, an aggregation of the various pieces of evidence supporting truthfulness of the assertion.

Week 02b — 09/01/2022

Today we talked about the random level set example at the end of the Week 02a slides, then moved on to new material about random sets and properties of the induced capacities, in particular, higher-order monotonicity and Choquet’s theorem. I’ll start with the material on maxitivity in the next lecture, then transition into discussion of (the related) possibility measures.

Week 02b lecture video and Slides

Remarks & references:

I omitted an important technical condition in the random level set example from the Week 02a slides: the function h needs to be upper semicontinuous. I’ve updated the slides in Week 02a and Week 02b accordingly.
K-monotonicity of capacities is related to monotonicity of functions like you study in calculus. Indeed, a function \(f\) from positive reals to reals is 2-monotone if it’s non-negative, non-increasing, and convex; more generally, it’s K-monotone if \( (-1)^k f^{(k)}\) is non-negative, non-increasing, and convex for \( k=0,1,\ldots,K-2 \). This perhaps explains why 2-monotone capacities appeared in Shapley’s work on so-called convex games. Of course, derivatives are like differences, so there’s a way to describe K-montonicity of capacities in terms of difference operators, so that it looks more like the definition of K-montonicity of a regular function, but I thought that notation was messier; to see it, check out the formulation in Molchanov’s book, Theory of Random Sets.

Week 02a — 08/30/2022

First we finished what was left over from the Week 01b material, namely, capacities, 2-monotonicity, and the corresponding no-sure-loss property. Then we finished (most of) the new material in the slides below about random sets.

Week 02a lecture video and Slides (small update on 09/02/2022)

Remarks & references:

I skipped over the discussion in the Week 01b material about the false confidence theorem, mostly because it’s too soon to appreciate it, though it does add a new perspective on the potential shortcomings of precise probability. Briefly, the false confidence theorem says that data-dependent (precise) probabilities, e.g., fiducial or Bayes posterior probabilities, tends to assign high probabilities to hypotheses about the unknown parameter that happen to be false. This is called false confidence because we’re inclined to be “confident” in hypotheses that are assigned high posterior probability, but this would be “false” if a hypothesis we’re confident in is false. Not all hypotheses are afflicted by false confidence but, presently, we don’t know which ones are. What makes this especially concerning to me is that we don’t know which hypotheses are afflicted — if I knew which hypotheses were problematic, then I could try to avoid them. The point is that there is a risk associated with the use of precise probabilities for uncertainty quantification in the context of statistical inference which can lead to systematic errors, and the only way to avoid that risk is through the use of imprecise probabilities. The original paper on false confidence is here; there’s also a comment by other authors and our response. I’ve also written about false confidence here and here. There are still some interesting open questions related to this. In particular, I think there’s a connection between false confidence and the issues identified in, e.g., Fraser’s paper and Gleser & Hwang’s paper.
As we’re getting into details of imprecise probabilities, I’d suggest taking a look at Chapter 4 (by Sebastien Destercke and Didier Dubois) of Augustin et al.’s Introduction to Imprecise Probabilities. Their focus is on “special cases” of imprecise probabilities, some of which I’ll cover in lecture. Their chapter doesn’t directly mention random sets, but I’ll make the connection.
I only briefly mentioned the measure-theoretic details of random sets in lecture, mostly because I think it’s important for students to know that there is rigorous mathematics behind this. If you’re interested in these details, check out Molchanov’s Theory of Random Sets.

Week 01b — 08/25/2022

Today’s lecture covered some background on precise probability, in particular, De Finetti’s justification for probabilities being (finitely) additive. I didn’t get as far into this today as I’d hoped, in particular, I didn’t get to the part about capacities, so we’ll finish this material at the beginning of the next lecture.

Week 01b lecture video and Slides

Remarks & references:

Two recent papers that express conformal prediction (a la Vovk & Shafer) to imprecise probability (possibility theory specifically) are here and here; the latter is better in my opinion because (a) the setup is more general and (b) it’s newer so ideas are more clearly developed.
In a footnote on the lecture slides I mentioned Cournot’s Principle. This is related to the subjectivity/objectivity of probability. Roughly, what Cournot’s principle says is that a probability model relates to the real world only through the events it assigns low/high probability to. Consequently, a probability model is sound if and only if a (pre-specified) event having small probability effectively won’t happen. This is important to the inductive logic of statistical inference, as I tried to explain on that same slide. If I have a sound model and my method is such that the probability of an error is small, then my inferences based on the given data are warranted because making an error is an event that effectively won’t happen. Glenn Shafer has written about this in various places, e.g., in Chapter 10 of Game-Theoretic Foundations for Probability & Finance.
Chapter 1 of Jay Kadane’s book gives a nice presentation of De Finetti’s coherence result. I also like the little story about the talking bird at the very beginning of that chapter.
The Seinfeld episode Bizzaro Jerry talks about Superman and the “Bizzaro world”.

Week 01a — 08/23/2022

As is typical, the first day’s lecture was pretty high-level. I aimed to introduce roughly what we’re going to be studying and why I think it’s interesting and important.

Week 01a lecture video and Slides

Remarks & references:

Some comments about the history and philosophy of imprecise probability are here. Glenn Shafer is an authority on the history of probability so you should browse the list of papers on his website if you’re interested. I like his recent paper looking back at the years before and after publication of his groundbreaking book A Mathematical Theory of Evidence.
I briefly mentioned the replication crisis in lecture, but the details of this are beyond the scope of the course. While there are some statistical factors contributing to this, there are bigger issues, e.g., academia’s publish-or-perish mentality. Combating this was one of our motivations behind the development of Researchers.One — see here, here, and/or here.
My work that describes the connection between p-values and possibility measures, arguing that possibility theory is the correct language in which to describe “frequentist inference” is here.
Fisher’s fiducial argument, as I mentioned, was ambitious. Dissatisfied with the Bayesian solutions at the time, Fisher sought to do what was thought to be impossible, to construct a probability distribution for the unknown parameter based only on data and the posited statistical model — no prior distribution. Savage described this as “enjoying the Bayesian omelette without breaking the eggs.” If you’re interested, I highly recommend the article by Sandy Zabell. You can also check out the article by Brad Efron. This is where “Fisher’s biggest blunder” comes from, though I don’t think Efron meant for this to be taken seriously, as he says later in the article: Here is a safe prediction for the 21st century: statisticians will be asked to solve bigger and more complicated problems. I believe that there is a good chance… something like fiducial inference will play an important role in this development. Maybe Fisher’s biggest blunder will become a big hit in the 21st century!
In this paper I define a confidence distribution as any distribution in the credal set determined by the possibility measure that I refer to as an inferential model. Here I’m also able to work out cases where the “maximally-diffuse” probability in the credal set agrees with the fiducial distribution, thus leading to certain decision-theoretic optimality results, similar to those in Taraldsen & Lindqvist.