Statlib

2. Statistical Risk and Markov Kernels🔗

Statistical decision theory gives a mathematical language for discussing optimality and accuracy in statistical methods. Informally, the workflow has four stages:

  1. run an experiment to study an object of interest;

  2. use the observation from the experiment to choose an action or decision;

  3. quantify the precision of that action;

  4. search for the best decision rule for the workflow.

Because the process moves from one measurable object to the next, a natural general framework is that of Markov kernels.

2.1. Markov kernels🔗

Given two measurable spaces (Z_1,\mathcal{M}_1) and (Z_2,\mathcal{M}_2), a Markov kernel from Z_1 to Z_2 is a map

K : Z_1 \times \mathcal{M}_2 \to [0,1]

such that:

  1. for every z_1 \in Z_1, K(z_1,\cdot) is a probability measure on (Z_2,\mathcal{M}_2);

  2. for every A \in \mathcal{M}_2, the function K(\cdot,A) is measurable on (Z_1,\mathcal{M}_1).

In Lean, Statlib uses Mathlib's ProbabilityTheory.Kernel interface for this notion.

2.2. Kernel-indexed statistical risk🔗

For each n \in \mathbb{N}, the kernel-indexed setup consists of:

  • measurable spaces (\Theta_n,\mathcal{A}_n) for parameters and (\Omega_n,\mathcal{G}_n) for experimental randomness;

  • a Markov kernel K_n : (\Theta_n,\mathcal{A}_n) \to (\Omega_n,\mathcal{G}_n) describing the experimental randomness conditional on a parameter;

  • a measurable space (S_n,\mathcal{F}_n) for the object of interest and a measurable functional \Psi_n : \mathcal{K}_n \to S_n, where \mathcal{K}_n is a class of Markov kernels equipped with the sigma-algebra generated by point evaluations K_n(\theta_n,A_n);

  • a measurable observation space (\mathcal{X}_n,\mathcal{B}_n) and a measurable observation map X_n : \Omega_n \to \mathcal{X}_n;

  • a measurable decision space (\mathcal{Y}_n,\mathcal{C}_n) and a randomized decision rule from (\mathcal{X}_n,\mathcal{B}_n) to (\mathcal{Y}_n,\mathcal{C}_n);

  • a jointly measurable loss function \ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].

The conditional risk at parameter \theta_n is the average loss

\operatorname{Risk}(T_n,\Psi_n;K_n;\theta_n) = \int\!\int \ell_n(y_n,\Psi_n(K_n))\, dT_n(X_n(\omega),dy_n)\, dK_n(\theta_n,d\omega).

structure InferenceModelofKernel (ι : Type*) (θ Ω S X Y : ι Type) [ i, MeasurableSpace (θ i)] [ i, MeasurableSpace (Ω i)] [ i, MeasurableSpace (S i)] [ i, MeasurableSpace (X i)] [ i, MeasurableSpace (Y i)] where domain (i : ι) : Set (Kernel (θ i) (Ω i)) functional (i : ι) : domain i S i measurable_functional (i : ι) : Measurable (functional i) data (i : ι) : Ω i X i measurable_data (i : ι) : Measurable (data i) decision_rule (i : ι) : Kernel (X i) (Y i) loss_function (i : ι) : (Y i) (S i) ℝ≥0∞ measurable_loss_function (i : ι) : Measurable (loss_function i).uncurrynoncomputable def conditionalRisk (I : InferenceModelofKernel ι θ Ω S X Y) {i : ι} (t : θ i) {κ : Kernel (θ i) (Ω i)} ( : κ I.domain i) : ℝ≥0∞ := ∫⁻ ω : Ω i, ∫⁻ y : Y i, (I.loss_function i) y (I.functional i κ, ) (I.decision_rule i) ((I.data i) ω) κ t

2.3. Measure-indexed statistical risk🔗

In many applications, one works at a lower level of generality with a class of probability measures on \Omega_n. The measure-indexed setup consists of:

  • a sequence indexed by n \in \mathbb{N};

  • a measurable space (\Omega_n,\mathcal{G}_n) for experimental randomness;

  • a class of probability measures \mathcal{P}_n on (\Omega_n,\mathcal{G}_n);

  • a measurable space (S_n,\mathcal{F}_n) for the object of interest and a measurable functional \Psi_n : \mathcal{P}_n \to S_n;

  • a measurable observation space (\mathcal{X}_n,\mathcal{B}_n) and a measurable observation map X_n : \Omega_n \to \mathcal{X}_n;

  • a measurable decision space (\mathcal{Y}_n,\mathcal{C}_n) and a randomized decision rule from (\mathcal{X}_n,\mathcal{B}_n) to (\mathcal{Y}_n,\mathcal{C}_n);

  • a jointly measurable loss function \ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].

The average loss at P_n \in \mathcal{P}_n is

\operatorname{Risk}(T_n,\Psi_n;P_n) = \int\!\int \ell_n(y_n,\Psi_n(P_n))\, dT_n(X_n(\omega),dy_n)\, dP_n(\omega).

structure InferenceModelofMeasure (ι : Type*) (Ω S X Y : ι Type) [ i, MeasurableSpace (Ω i)] [ i, MeasurableSpace (S i)] [ i, MeasurableSpace (X i)] [ i, MeasurableSpace (Y i)] where domain (i : ι) : Set (Measure (Ω i)) functional (i : ι) : domain i S i measurable_functional (i : ι) : Measurable (functional i) data (i : ι) : Ω i X i measurable_data (i : ι) : Measurable (data i) decision_rule (i : ι) : Kernel (X i) (Y i) loss_function (i : ι) : (Y i) (S i) ℝ≥0∞ measurable_loss_function (i : ι) : Measurable (loss_function i).uncurrynoncomputable def conditionalRisk (I : InferenceModelofMeasure ι Ω S X Y) {i : ι} {μ : Measure (Ω i)} ( : μ I.domain i) : ℝ≥0∞ := ∫⁻ ω : Ω i, ∫⁻ y : Y i, (I.loss_function i) y (I.functional i μ, ) (I.decision_rule i) ((I.data i) ω) μnoncomputable def of_InferenceModelofMeasure [ i, Nonempty (θ i)] (I : InferenceModelofMeasure ι Ω S X Y) : InferenceModelofKernel ι θ Ω S X Y where domain (i : ι) := (Kernel.of_measure (θ i) (Ω i)) '' (I.domain i) functional (i : ι) := (I.functional i) Kernel.imageToMeasure (θ i) (Ω i) (I.domain i) measurable_functional (i : ι) := (I.measurable_functional i).comp (Kernel.measurable_imageToMeasure (θ i) (Ω i) (I.domain i)) data := I.data measurable_data := I.measurable_data decision_rule := I.decision_rule loss_function := I.loss_function measurable_loss_function := I.measurable_loss_function

2.4. Consistency and rates🔗

The same setup leads to three basic asymptotic notions.

Given a class \mathcal{P}_n, an S_n-valued functional \Psi_n, data X_n, decision rules T_n, and a loss function \ell_n:

  • T_n is pointwise consistent if, for every P_n \in \mathcal{P}_n, \operatorname{Risk}(T_n,\Psi_n;P_n) \to 0 as n \to \infty;

  • T_n is uniformly consistent over \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n if \sup_{P_n \in \tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n) \to 0 as n \to \infty;

  • T_n has rate of convergence r_n(\tilde{\mathcal{P}}_n) if

0 < \liminf_{n\to\infty} \frac{ \sup_{P_n\in\tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n)} {r_n(\tilde{\mathcal{P}}_n)} \le \limsup_{n\to\infty} \frac{ \sup_{P_n\in\tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n)} {r_n(\tilde{\mathcal{P}}_n)} < \infty.

def IsConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop := (μ : i, Measure (Ω i)), ( : i, μ i I.domain i) Tendsto (fun i => conditionalRisk I ( i)) l (𝓝 0) def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop := Tendsto (fun i => (μ : i, Measure (Ω i)), ( : i, μ i I.domain i), conditionalRisk I ( i)) l (𝓝 0) def HasRateOfConvergence (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) (r : ι ℝ≥0∞) : Prop := 0 < l.liminf (fun i => ( (μ : i, Measure (Ω i)), ( : i, μ i I.domain i), conditionalRisk I ( i)) / r i) l.limsup (fun i => ( (μ : i, Measure (Ω i)), ( : i, μ i I.domain i), conditionalRisk I ( i)) / r i) < def IsConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop := (t : i, θ i) (κ : i, Kernel (θ i) (Ω i)), ( : i, κ i I.domain i) Tendsto (fun i => conditionalRisk I (t i) ( i)) l (𝓝 0) def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop := (t : i, θ i), Tendsto (fun i => (κ : i, Kernel (θ i) (Ω i)), ( : i, κ i I.domain i), conditionalRisk I (t i) ( i)) l (𝓝 0)

2.5. Guiding questions🔗

The roadmap can be organized around the following questions:

  • Does there exist a pointwise consistent sequence of decision rules?

  • Given \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n, does there exist a uniformly consistent sequence of decision rules over \tilde{\mathcal{P}}_n?

  • Given \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n and a candidate rate r_n(\tilde{\mathcal{P}}_n), does there exist a decision rule with that rate of convergence?

  • If r_n(P_n) is the rate of convergence of T_n at some P_n \in \mathcal{P}_n, is there a probability measure \mathbb{Q} on ([0,\infty],\mathcal{B}([0,\infty])) such that

P_n \circ \left(r_n(P_n)\ell_n(X_n,\Psi_n(P_n))\right) \rightsquigarrow \mathbb{Q}

as n \to \infty?