PAC learning asks: when can a learning algorithm generalize from finite samples? The answer is entirely characterized by a combinatorial quantity — the VC dimension.
PAC Learning
A hypothesis class H⊆{0,1}X is PAC learnable if there exists an algorithm A such that: for all distributions D over X×{0,1} and all ε,δ>0, given m≥m(ε,δ) i.i.d. samples,
S∼DmPr[LD(A(S))≤ε]≥1−δ
The question is: which H are PAC learnable, and how large must m be?
Shattering and VC Dimension
A set C={x1,…,xk}⊂X is shattered by H if for every labeling y∈{0,1}C, there exists h∈H with h↾C=y:
∣{h↾C:h∈H}∣=2k
The VC dimension is:
VCdim(H)=sup{∣C∣:C is shattered by H}
The Fundamental Theorem
Theorem. Let H be a hypothesis class over domain X. The following are equivalent:
- H is PAC learnable.
- H has finite VC dimension.
Moreover, the sample complexity satisfies:
m(ε,δ)=Θ(ε2d+log(1/δ))
where d=VCdim(H).
The Sauer–Shelah Lemma
The key combinatorial engine is:
Lemma. If VCdim(H)≤d, then for all m:
∣H↾m∣≤i=0∑d(im)≤(dem)d
This polynomial (rather than exponential) growth is exactly what enables uniform convergence.
Example: Halfspaces in Rn
The class of linear classifiers H={sgn(w⋅x+b):w∈Rn,b∈R} has VCdim=n+1. So PAC learning halfspaces in Rn requires O((n+log(1/δ))/ε2) samples — independent of the size of the domain.