2. Statistical Risk and Markov Kernels🔗

Statistical decision theory gives a mathematical language for discussing optimality and accuracy in statistical methods. Informally, the workflow has four stages:

run an experiment to study an object of interest;
use the observation from the experiment to choose an action or decision;
quantify the precision of that action;
search for the best decision rule for the workflow.

Because the process moves from one measurable object to the next, a natural general framework is that of Markov kernels.

2.1. Markov kernels🔗

Given two measurable spaces (Z_1,\mathcal{M}_1) and (Z_2,\mathcal{M}_2), a Markov kernel from Z_1 to Z_2 is a map

K : Z_1 \times \mathcal{M}_2 \to [0,1]

such that:

for every z_1 \in Z_1, K(z_1,\cdot) is a probability measure on (Z_2,\mathcal{M}_2);
for every A \in \mathcal{M}_2, the function K(\cdot,A) is measurable on (Z_1,\mathcal{M}_1).

In Lean, Statlib uses Mathlib's ProbabilityTheory.Kernel interface for this notion.

2.2. Kernel-indexed statistical risk🔗

For each n \in \mathbb{N}, the kernel-indexed setup consists of:

measurable spaces (\Theta_n,\mathcal{A}_n) for parameters and (\Omega_n,\mathcal{G}_n) for experimental randomness;
a Markov kernel K_n : (\Theta_n,\mathcal{A}_n) \to (\Omega_n,\mathcal{G}_n) describing the experimental randomness conditional on a parameter;
a measurable space (S_n,\mathcal{F}_n) for the object of interest and a measurable functional \Psi_n : \mathcal{K}_n \to S_n, where \mathcal{K}_n is a class of Markov kernels equipped with the sigma-algebra generated by point evaluations K_n(\theta_n,A_n);
a measurable observation space (\mathcal{X}_n,\mathcal{B}_n) and a measurable observation map X_n : \Omega_n \to \mathcal{X}_n;
a measurable decision space (\mathcal{Y}_n,\mathcal{C}_n) and a randomized decision rule from (\mathcal{X}_n,\mathcal{B}_n) to (\mathcal{Y}_n,\mathcal{C}_n);
a jointly measurable loss function \ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].

The conditional risk at parameter \theta_n is the average loss

\operatorname{Risk}(T_n,\Psi_n;K_n;\theta_n) = \int\!\int \ell_n(y_n,\Psi_n(K_n))\, dT_n(X_n(\omega),dy_n)\, dK_n(\theta_n,d\omega).

structure InferenceModelofKernel (ι : Type*) (θ Ω S X Y : ι → Type) [∀ i, MeasurableSpace (θ i)]
    [∀ i, MeasurableSpace (Ω i)] [∀ i, MeasurableSpace (S i)] [∀ i, MeasurableSpace (X i)]
    [∀ i, MeasurableSpace (Y i)] where
  domain (i : ι) : Set (Kernel (θ i) (Ω i))
  functional (i : ι) : domain i → S i
  measurable_functional (i : ι) : Measurable (functional i)
  data (i : ι) : Ω i → X i
  measurable_data (i : ι) : Measurable (data i)
  decision_rule (i : ι) : Kernel (X i) (Y i)
  loss_function (i : ι) : (Y i) → (S i) → ℝ≥0∞
  measurable_loss_function (i : ι) : Measurable (loss_function i).uncurry

noncomputable def conditionalRisk (I : InferenceModelofKernel ι θ Ω S X Y) {i : ι} (t : θ i)
    {κ : Kernel (θ i) (Ω i)} (hκ : κ ∈ I.domain i) : ℝ≥0∞ :=
  ∫⁻ ω : Ω i, ∫⁻ y : Y i,
    (I.loss_function i) y (I.functional i ⟨κ, hκ⟩) ∂(I.decision_rule i) ((I.data i) ω) ∂κ t

2.3. Measure-indexed statistical risk🔗

In many applications, one works at a lower level of generality with a class of probability measures on \Omega_n. The measure-indexed setup consists of:

a sequence indexed by n \in \mathbb{N};
a measurable space (\Omega_n,\mathcal{G}_n) for experimental randomness;
a class of probability measures \mathcal{P}_n on (\Omega_n,\mathcal{G}_n);
a measurable space (S_n,\mathcal{F}_n) for the object of interest and a measurable functional \Psi_n : \mathcal{P}_n \to S_n;
a measurable observation space (\mathcal{X}_n,\mathcal{B}_n) and a measurable observation map X_n : \Omega_n \to \mathcal{X}_n;
a measurable decision space (\mathcal{Y}_n,\mathcal{C}_n) and a randomized decision rule from (\mathcal{X}_n,\mathcal{B}_n) to (\mathcal{Y}_n,\mathcal{C}_n);
a jointly measurable loss function \ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].

The average loss at P_n \in \mathcal{P}_n is

\operatorname{Risk}(T_n,\Psi_n;P_n) = \int\!\int \ell_n(y_n,\Psi_n(P_n))\, dT_n(X_n(\omega),dy_n)\, dP_n(\omega).

structure InferenceModelofMeasure (ι : Type*) (Ω S X Y : ι → Type) [∀ i, MeasurableSpace (Ω i)]
    [∀ i, MeasurableSpace (S i)] [∀ i, MeasurableSpace (X i)] [∀ i, MeasurableSpace (Y i)] where
  domain (i : ι) : Set (Measure (Ω i))
  functional (i : ι) : domain i → S i
  measurable_functional (i : ι) : Measurable (functional i)
  data (i : ι) : Ω i → X i
  measurable_data (i : ι) : Measurable (data i)
  decision_rule (i : ι) : Kernel (X i) (Y i)
  loss_function (i : ι) : (Y i) → (S i) → ℝ≥0∞
  measurable_loss_function (i : ι) : Measurable (loss_function i).uncurry

noncomputable def conditionalRisk (I : InferenceModelofMeasure ι Ω S X Y) {i : ι}
    {μ : Measure (Ω i)} (hμ : μ ∈ I.domain i) : ℝ≥0∞ :=
  ∫⁻ ω : Ω i, ∫⁻ y : Y i,
    (I.loss_function i) y (I.functional i ⟨μ, hμ⟩) ∂(I.decision_rule i) ((I.data i) ω) ∂μ

noncomputable def of_InferenceModelofMeasure [∀ i, Nonempty (θ i)]
    (I : InferenceModelofMeasure ι Ω S X Y) :
    InferenceModelofKernel ι θ Ω S X Y where
  domain (i : ι) := (Kernel.of_measure (θ i) (Ω i)) '' (I.domain i)
  functional (i : ι) := (I.functional i) ∘ Kernel.imageToMeasure (θ i) (Ω i) (I.domain i)
  measurable_functional (i : ι) :=
    (I.measurable_functional i).comp (Kernel.measurable_imageToMeasure (θ i) (Ω i) (I.domain i))
  data := I.data
  measurable_data := I.measurable_data
  decision_rule := I.decision_rule
  loss_function := I.loss_function
  measurable_loss_function := I.measurable_loss_function

2.4. Consistency and rates🔗

The same setup leads to three basic asymptotic notions.

Given a class \mathcal{P}_n, an S_n-valued functional \Psi_n, data X_n, decision rules T_n, and a loss function \ell_n:

T_n is pointwise consistent if, for every P_n \in \mathcal{P}_n, \operatorname{Risk}(T_n,\Psi_n;P_n) \to 0 as n \to \infty;
T_n is uniformly consistent over \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n if \sup_{P_n \in \tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n) \to 0 as n \to \infty;
T_n has rate of convergence r_n(\tilde{\mathcal{P}}_n) if

0 < \liminf_{n\to\infty} \frac{ \sup_{P_n\in\tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n)} {r_n(\tilde{\mathcal{P}}_n)} \le \limsup_{n\to\infty} \frac{ \sup_{P_n\in\tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n)} {r_n(\tilde{\mathcal{P}}_n)} < \infty.

def IsConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop :=
  ∀ (μ : ∀ i, Measure (Ω i)), (hμ : ∀ i, μ i ∈ I.domain i) →
    Tendsto (fun i => conditionalRisk I (hμ i)) l (𝓝 0)

def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop :=
  Tendsto (fun i =>
    ⨆ (μ : ∀ i, Measure (Ω i)), ⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i))
    l (𝓝 0)

def HasRateOfConvergence (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) (r : ι → ℝ≥0∞) :
    Prop :=
  0 < l.liminf (fun i => (⨆ (μ : ∀ i, Measure (Ω i)),
    ⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i)) / r i) ∧
    l.limsup (fun i => (⨆ (μ : ∀ i, Measure (Ω i)),
    ⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i)) / r i) < ∞

def IsConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop :=
  ∀ (t : ∀ i, θ i) (κ : ∀ i, Kernel (θ i) (Ω i)), (hκ : ∀ i, κ i ∈ I.domain i) →
    Tendsto (fun i => conditionalRisk I (t i) (hκ i)) l (𝓝 0)

def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop :=
  ∀ (t : ∀ i, θ i), Tendsto (fun i =>
    ⨆ (κ : ∀ i, Kernel (θ i) (Ω i)), ⨆ (hκ : ∀ i, κ i ∈ I.domain i), conditionalRisk I (t i) (hκ i))
    l (𝓝 0)

2.5. Guiding questions🔗

The roadmap can be organized around the following questions:

Does there exist a pointwise consistent sequence of decision rules?
Given \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n, does there exist a uniformly consistent sequence of decision rules over \tilde{\mathcal{P}}_n?
Given \tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n and a candidate rate r_n(\tilde{\mathcal{P}}_n), does there exist a decision rule with that rate of convergence?
If r_n(P_n) is the rate of convergence of T_n at some P_n \in \mathcal{P}_n, is there a probability measure \mathbb{Q} on ([0,\infty],\mathcal{B}([0,\infty])) such that

P_n \circ \left(r_n(P_n)\ell_n(X_n,\Psi_n(P_n))\right) \rightsquigarrow \mathbb{Q}

as n \to \infty?