2. Statistical Risk and Markov Kernels
Statistical decision theory gives a mathematical language for discussing optimality and accuracy in statistical methods. Informally, the workflow has four stages:
-
run an experiment to study an object of interest;
-
use the observation from the experiment to choose an action or decision;
-
quantify the precision of that action;
-
search for the best decision rule for the workflow.
Because the process moves from one measurable object to the next, a natural general framework is that of Markov kernels.
2.1. Markov kernels
Given two measurable spaces (Z_1,\mathcal{M}_1) and
(Z_2,\mathcal{M}_2), a Markov kernel from Z_1 to Z_2 is a map
K : Z_1 \times \mathcal{M}_2 \to [0,1]
such that:
-
for every
z_1 \in Z_1,K(z_1,\cdot)is a probability measure on(Z_2,\mathcal{M}_2); -
for every
A \in \mathcal{M}_2, the functionK(\cdot,A)is measurable on(Z_1,\mathcal{M}_1).
In Lean, Statlib uses Mathlib's ProbabilityTheory.Kernel interface for this
notion.
2.2. Kernel-indexed statistical risk
For each n \in \mathbb{N}, the kernel-indexed setup consists of:
-
measurable spaces
(\Theta_n,\mathcal{A}_n)for parameters and(\Omega_n,\mathcal{G}_n)for experimental randomness; -
a Markov kernel
K_n : (\Theta_n,\mathcal{A}_n) \to (\Omega_n,\mathcal{G}_n)describing the experimental randomness conditional on a parameter; -
a measurable space
(S_n,\mathcal{F}_n)for the object of interest and a measurable functional\Psi_n : \mathcal{K}_n \to S_n, where\mathcal{K}_nis a class of Markov kernels equipped with the sigma-algebra generated by point evaluationsK_n(\theta_n,A_n); -
a measurable observation space
(\mathcal{X}_n,\mathcal{B}_n)and a measurable observation mapX_n : \Omega_n \to \mathcal{X}_n; -
a measurable decision space
(\mathcal{Y}_n,\mathcal{C}_n)and a randomized decision rule from(\mathcal{X}_n,\mathcal{B}_n)to(\mathcal{Y}_n,\mathcal{C}_n); -
a jointly measurable loss function
\ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].
The conditional risk at parameter \theta_n is the average loss
\operatorname{Risk}(T_n,\Psi_n;K_n;\theta_n)
=
\int\!\int
\ell_n(y_n,\Psi_n(K_n))\,
dT_n(X_n(\omega),dy_n)\,
dK_n(\theta_n,d\omega).
structure InferenceModelofKernel (ι : Type*) (θ Ω S X Y : ι → Type) [∀ i, MeasurableSpace (θ i)]
[∀ i, MeasurableSpace (Ω i)] [∀ i, MeasurableSpace (S i)] [∀ i, MeasurableSpace (X i)]
[∀ i, MeasurableSpace (Y i)] where
domain (i : ι) : Set (Kernel (θ i) (Ω i))
functional (i : ι) : domain i → S i
measurable_functional (i : ι) : Measurable (functional i)
data (i : ι) : Ω i → X i
measurable_data (i : ι) : Measurable (data i)
decision_rule (i : ι) : Kernel (X i) (Y i)
loss_function (i : ι) : (Y i) → (S i) → ℝ≥0∞
measurable_loss_function (i : ι) : Measurable (loss_function i).uncurrynoncomputable def conditionalRisk (I : InferenceModelofKernel ι θ Ω S X Y) {i : ι} (t : θ i)
{κ : Kernel (θ i) (Ω i)} (hκ : κ ∈ I.domain i) : ℝ≥0∞ :=
∫⁻ ω : Ω i, ∫⁻ y : Y i,
(I.loss_function i) y (I.functional i ⟨κ, hκ⟩) ∂(I.decision_rule i) ((I.data i) ω) ∂κ t2.3. Measure-indexed statistical risk
In many applications, one works at a lower level of generality with a class of
probability measures on \Omega_n. The measure-indexed setup consists of:
-
a sequence indexed by
n \in \mathbb{N}; -
a measurable space
(\Omega_n,\mathcal{G}_n)for experimental randomness; -
a class of probability measures
\mathcal{P}_non(\Omega_n,\mathcal{G}_n); -
a measurable space
(S_n,\mathcal{F}_n)for the object of interest and a measurable functional\Psi_n : \mathcal{P}_n \to S_n; -
a measurable observation space
(\mathcal{X}_n,\mathcal{B}_n)and a measurable observation mapX_n : \Omega_n \to \mathcal{X}_n; -
a measurable decision space
(\mathcal{Y}_n,\mathcal{C}_n)and a randomized decision rule from(\mathcal{X}_n,\mathcal{B}_n)to(\mathcal{Y}_n,\mathcal{C}_n); -
a jointly measurable loss function
\ell_n : \mathcal{Y}_n \times S_n \to [0,+\infty].
The average loss at P_n \in \mathcal{P}_n is
\operatorname{Risk}(T_n,\Psi_n;P_n)
=
\int\!\int
\ell_n(y_n,\Psi_n(P_n))\,
dT_n(X_n(\omega),dy_n)\,
dP_n(\omega).
structure InferenceModelofMeasure (ι : Type*) (Ω S X Y : ι → Type) [∀ i, MeasurableSpace (Ω i)]
[∀ i, MeasurableSpace (S i)] [∀ i, MeasurableSpace (X i)] [∀ i, MeasurableSpace (Y i)] where
domain (i : ι) : Set (Measure (Ω i))
functional (i : ι) : domain i → S i
measurable_functional (i : ι) : Measurable (functional i)
data (i : ι) : Ω i → X i
measurable_data (i : ι) : Measurable (data i)
decision_rule (i : ι) : Kernel (X i) (Y i)
loss_function (i : ι) : (Y i) → (S i) → ℝ≥0∞
measurable_loss_function (i : ι) : Measurable (loss_function i).uncurrynoncomputable def conditionalRisk (I : InferenceModelofMeasure ι Ω S X Y) {i : ι}
{μ : Measure (Ω i)} (hμ : μ ∈ I.domain i) : ℝ≥0∞ :=
∫⁻ ω : Ω i, ∫⁻ y : Y i,
(I.loss_function i) y (I.functional i ⟨μ, hμ⟩) ∂(I.decision_rule i) ((I.data i) ω) ∂μnoncomputable def of_InferenceModelofMeasure [∀ i, Nonempty (θ i)]
(I : InferenceModelofMeasure ι Ω S X Y) :
InferenceModelofKernel ι θ Ω S X Y where
domain (i : ι) := (Kernel.of_measure (θ i) (Ω i)) '' (I.domain i)
functional (i : ι) := (I.functional i) ∘ Kernel.imageToMeasure (θ i) (Ω i) (I.domain i)
measurable_functional (i : ι) :=
(I.measurable_functional i).comp (Kernel.measurable_imageToMeasure (θ i) (Ω i) (I.domain i))
data := I.data
measurable_data := I.measurable_data
decision_rule := I.decision_rule
loss_function := I.loss_function
measurable_loss_function := I.measurable_loss_function2.4. Consistency and rates
The same setup leads to three basic asymptotic notions.
Given a class \mathcal{P}_n, an S_n-valued functional \Psi_n, data
X_n, decision rules T_n, and a loss function \ell_n:
-
T_nis pointwise consistent if, for everyP_n \in \mathcal{P}_n,\operatorname{Risk}(T_n,\Psi_n;P_n) \to 0asn \to \infty; -
T_nis uniformly consistent over\tilde{\mathcal{P}}_n \subseteq \mathcal{P}_nif\sup_{P_n \in \tilde{\mathcal{P}}_n} \operatorname{Risk}(T_n,\Psi_n;P_n) \to 0asn \to \infty; -
T_nhas rate of convergencer_n(\tilde{\mathcal{P}}_n)if
0 <
\liminf_{n\to\infty}
\frac{
\sup_{P_n\in\tilde{\mathcal{P}}_n}
\operatorname{Risk}(T_n,\Psi_n;P_n)}
{r_n(\tilde{\mathcal{P}}_n)}
\le
\limsup_{n\to\infty}
\frac{
\sup_{P_n\in\tilde{\mathcal{P}}_n}
\operatorname{Risk}(T_n,\Psi_n;P_n)}
{r_n(\tilde{\mathcal{P}}_n)}
< \infty.
def IsConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop :=
∀ (μ : ∀ i, Measure (Ω i)), (hμ : ∀ i, μ i ∈ I.domain i) →
Tendsto (fun i => conditionalRisk I (hμ i)) l (𝓝 0)
def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) : Prop :=
Tendsto (fun i =>
⨆ (μ : ∀ i, Measure (Ω i)), ⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i))
l (𝓝 0)
def HasRateOfConvergence (l : Filter ι) (I : InferenceModelofMeasure ι Ω S X Y) (r : ι → ℝ≥0∞) :
Prop :=
0 < l.liminf (fun i => (⨆ (μ : ∀ i, Measure (Ω i)),
⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i)) / r i) ∧
l.limsup (fun i => (⨆ (μ : ∀ i, Measure (Ω i)),
⨆ (hμ : ∀ i, μ i ∈ I.domain i), conditionalRisk I (hμ i)) / r i) < ∞def IsConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop :=
∀ (t : ∀ i, θ i) (κ : ∀ i, Kernel (θ i) (Ω i)), (hκ : ∀ i, κ i ∈ I.domain i) →
Tendsto (fun i => conditionalRisk I (t i) (hκ i)) l (𝓝 0)
def IsUniformlyConsistent (l : Filter ι) (I : InferenceModelofKernel ι θ Ω S X Y) : Prop :=
∀ (t : ∀ i, θ i), Tendsto (fun i =>
⨆ (κ : ∀ i, Kernel (θ i) (Ω i)), ⨆ (hκ : ∀ i, κ i ∈ I.domain i), conditionalRisk I (t i) (hκ i))
l (𝓝 0)2.5. Guiding questions
The roadmap can be organized around the following questions:
-
Does there exist a pointwise consistent sequence of decision rules?
-
Given
\tilde{\mathcal{P}}_n \subseteq \mathcal{P}_n, does there exist a uniformly consistent sequence of decision rules over\tilde{\mathcal{P}}_n? -
Given
\tilde{\mathcal{P}}_n \subseteq \mathcal{P}_nand a candidate rater_n(\tilde{\mathcal{P}}_n), does there exist a decision rule with that rate of convergence? -
If
r_n(P_n)is the rate of convergence ofT_nat someP_n \in \mathcal{P}_n, is there a probability measure\mathbb{Q}on([0,\infty],\mathcal{B}([0,\infty]))such that
P_n \circ
\left(r_n(P_n)\ell_n(X_n,\Psi_n(P_n))\right)
\rightsquigarrow \mathbb{Q}
as n \to \infty?