Total Variation Distance: The total variation distance between
distributions P and Q, with density functions p and q, is defined to
be
TV(P,Q)=A⊂Esup∣Pp(A)−Pq(A)∣.
This is equivalent to half the L1 distance between p and q:
TV(P,Q)=21∫E∣p(x)−q(x)∣dx.
This is a genuine metric.
Unfortunately, it is hard to estimate.
Kullback-Leibler Divergence: The Kullback-Leibler (known as relative
entropy in Information Theory) divergence between P and Q is defined to
be
KL(P∥Q)=∫Ep(x)log(q(x)p(x))dx,
where we assign the value ∞ if the support of p is not contained
in the support of q (if it is, then anywhere q=0, we will also have p=0
and thus the points at which the integrand is not defined will all be
removable discontinuities).
While positive semi-definite, KL-divergence is not a true metric, since
it is anti-symmetric. It also fails to satify a trangle inequality. It is,
however, an expectation. Hence, it can be replaced with a sample mean and
estimated.
Professor Rigollet calls the act of replacing an expectation with a
sample mean (i.e., the application of LLN) "the statistical hammer."
The implication here is that it's our simplest (and often only) tool.
Examples
Let Xn=Poi(1/n) and let δ0 be a point mass centered at 0.
Then TV(Xn,δ0)→0.
Let P=Bin(n,p), Q=Bin(n,q), where p,q∈(0,1), and
write their densities with one function
f(p,k)=(kn)pk(1−p)n−k,
and similarly for f(n,q). Then it is actually a pretty straightforward
calculation to show that
KL(P∥Q)=np⋅log(qp)+(n−np)⋅log(1−q1−p).
Let P=N(a,1) and let Q=N(b,1). Then (also pretty straightforward to
calculate):
KL(P∥Q)=21(a−b)2.
Maximum Likelihood Estimation
Definitions
Let X1,X2,…,Xn be an iid sample from a distribution with
density f(x;θ). The Likelihood of the sample is
where in this case the likelihood is of a one-element sample, and the bold
H denotes the Hessian operator. In one dimension, this reduces to
I(θ)=−E(ℓ′′(θ)).
Equivalently, we also have
I(θ)=Var(ℓ′(θ)).
This latter definition is usually harder to work with, but has a more direct
connection to maximum likelihood estimators.
Throughout, we will be discussing ways to estimate the value of a "true"
parameter θ∗ of a distribution Pθ∗, given a
model (E,{Pθ:θ∈Θ}). A noble goal might be to
build an estimator TV(Pθ,Pθ∗) and compute the
argmin using this estimator. However, TV distance is hard to estimate in
general, so we use KL-divergence instead. Since this function is an
expectation, it can be replaced by a sample mean (using LLN), and is therefore
easy to estimate.
For the rest of this section, suppose we are estimating a distribution
P=Pθ∗ with a parametric family of
distributions {Pθ:θ∈Θ}. We will proceed to do
this by estimating the minimizer (argmin) of KL(Pθ,P), which is θ∗, by the positive semidefiniteness (or
nonnegative definiteness?) of KL.
The strategy for doing so will involve first estimating KL divergence and
finding the minimizer of that estimator KL. That the argmin of
KL converges to the argmin of KL follows from "nice analytic
properties" of these functions. I'm guessing that KL is at least C1 and
the convergence is relatively strong.
Estimating KL Divergence
Recall that KL(Pθ,P) is an expectation: if fθ
and f are the densities of Pθ and P, respectively,
then
Therefore, the minimizer of KL is the maximum likelihood estimator
θ^ of θ∗. Furthermore (avoiding a bunch of details
necessary for this implication), we have
θ^→(p)θ∗.
The Asymptotic Variance of MLE
The MLE is not only consistent, but also satisfies a central limit theorem:
n(θ^−θ∗)→(d)N(0,V(θ∗)),
where V(θ∗) represents the asymptotic variance of θ^. But
what is this asymptotic variance?!? Turns out that under some mild conditions,
the asymptotic variance of θ^ is known.
Theorem Assume the following.
θ∗ is identifiable.
θ∗ is an interior point of Θ.
The Fisher information matrix I(θ) is invertible in a neighborhood of
θ∗.
All the functions involved are "nice".
The support of Pθ does not depend on θ.
Then
V(θ^)=I(θ∗)−1.
Proof Write ℓi(θ)=logfθ(Xi). We start with a couple
of observations:
Since θ^ is the unique maximizer of log(Ln(X1,X2,…,Xn;θ)),
dθd∣∣θ=θ^i=1∑nℓi=i=1∑nℓi(θ^)=0.
Since θ∗ is the unique minimizer of KL(Pθ,P)
and this differs from E(ℓ(θ)) by a constant, we have
By CLT, the term on the left converges to N(0,I(θ∗)). By LLN, the
term n−1∑iℓi′′(θ∗) converges to E(ℓ′′(θ∗))=−I(θ∗).
Therefore, rearranging, we have
I(θ∗)⋅n(θ^−θ∗)→(d)N(0,I(θ∗)),
therefore,
n(θ^−θ∗)→(d)N(0,I(θ∗)−1).
Remark: This proof only works in one dimension. In higher dimensions, there
is a lack of commutativity that results in a more complicated expression in the
end.
Remark: Recall the original definition of Fisher information as the Hessian
of log-likelihood. This adds geometric intuition to the result: If the
log-likelihood is more tightly curved at θ∗, then MLE will vary less
around the maximum and vice versa. The word "information" is also more than
superficial with this in mind; i.e., more "information" iff less variance, which
translates to tighter confidence intervals around MLE.
Method of Moments
Requires model to be well-specified (unlike MLE, which will always find the
nearest distribution Pθ to P).
Computationally simpler though.
The idea is we estimate the moments of P with the empirical moments
mk=n1i=1∑nXik
By LLN, these converge to the moments of P (provided the model is
well specified).
Here is how it works. Suppose Θ⊂Rd and write
M(θ)=(m1(θ),…,md(θ)).
Assume M is one-to-one, so that we can write
θ=M−1(m1(θ),…,md(θ)).
Then the moments estimator is
θnMM=M−1(m1,…,md).
We can generalize this to other functions g1(x),…,gd(x) which specify
θ, i.e.,
θ=M−1(m1(θ),…,md(θ)),
where for each k,
mk(θ)=Eθ[gk(X)].
Then the generalized method of moments estimator is
θnGMM=M−1(m1,…,md),
where for each k,
mk=n1i=1∑ngk(Xi).
Example: To see a simple example of why we might want to generalize beyond
simply estimating moments directly, consider the normal distribution
N(μ,σ2). The GMM estimator has g1(x)=x and g2(x)=x2−x.
Asymptotic Normality of GMM estimators
Theorem:
Assume M is one-to-one and M−1 is continuously differentiable in a
neighborhood of θ∗.
Let Σ(θ) be the covariance matrix of the vector (g1(X1),…,gd(X2) (assume this exists).
Then
n(θnGMM−θ∗)→(d)N(0,Γ(θ∗))w.r.t. Pθ∗,
where
Γ(θ)=[∂θ∂M−1(M(θ))]TΣ(θ)[∂θ∂M−1(M(θ))].
MLE versus GMM
In general, the MLE is more accurate. MLE still gives good results if model
is misspecified
Computational issues: Sometimes, the MLE is intractable but MM is easier
(polynomial equations)
M-Estimation
Suppose we are agnostic of any statistical model, and/or the quantity we are
most interested in estimating is not simply the parameter of a distribution.
In this case, we can still estimate the quantity by optimizing a suitable
objective (e.g., minimizing a cost function). This is called M-Estimation
(the M stands for maximum or minimum), and is the framework for "traditional"
(not statistically motivated) machine learning. The framework is as follows.
Let X1,X2,…,Xn be an iid sample from an unspecified
probability distribution P.
Let μ∗ be some parameter associated to the sample, e.g., some summary
statistic.
Find a function ρ:E×M→R, where
M is the set of all possible values for μ, such that the
function
Q(μ)=E(ρ(X1,μ))
achieves its minimum (or maximum) at μ∗.
Replace the estimation with an average and proceed as with MLE.
Examples:
Let E=M=Rd and let μ∗=E(X). An
M-estimator is ρ(x,μ)=∥x−μ∥22.
Let E=M=Rd and let μ∗ be a median of
P. An M-estimator is ρ(x,μ)=∥x−μ∥11.
Let E=M=R and let μ∗ be the
α-quantile of P. Then an M-estimator is ρ(x,μ)=Cα(x−μ), where
Cα(x)={−(1−α)xαx::x<0x≥0,
This function is called a check function. >
Asymptotic Normality of M-estimators
In the case of MLE, we have asymptotic normality and known asymptotic variance
(inverse fisher information). To what extent do these properties generalize to
M estimators? It turns out they generalize quite well. We will have asymptotic
normality for M-estimators, and the asymptotic variance will have an expression
only marginally less concise than that of the MLE (this is probably subject to
some smoothness conditions on ρ). First, we make the following
definitions. In one dimension, let
J(μ)=∂μ∂μT∂2Q(μ)=E[∂μ∂μT∂2ρ(X1,μ)]
and let
K(μ)=Cov[∂μ∂ρ(X1,μ)].
In higher dimensions,
J(μ)=E[Hρ]
is the expected curvature of loss and
K(μ)=Cov[∇μρ(X1,μ)]
is the covariance matrix of the loss gradient (as a function of μ only).
Remark: In the case of MLE, J(θ)=K(θ)=I(θ).
Theorem: With notation as above, assume the following.
μ∗ is the unique minimizer of Q;
J(μ) is invertible in a neighborhood of μ∗;
A "few more technical conditions." (e.g., twice-differentiability of ρ,
inverse of J is continuous, etc.).
Then μn satisfies
μn→(p)μ∗
and
n(μn−μ∗)→(d)N(0,J(μ∗)−1K(μ∗)J(μ∗)−1).
The proof of this theorem is very similar to the MLE case in one dimension.