ai · 2 min read

VC Dimension and the Fundamental Theorem of Statistical Learning

What makes a hypothesis class learnable? The VC dimension gives a clean combinatorial answer, and the fundamental theorem ties it all together.

#learning-theory#vc-dimension#pac-learning#generalization

PAC learning asks: when can a learning algorithm generalize from finite samples? The answer is entirely characterized by a combinatorial quantity — the VC dimension.

PAC Learning

A hypothesis class H{0,1}X\mathcal{H} \subseteq \{0,1\}^X is PAC learnable if there exists an algorithm AA such that: for all distributions D\mathcal{D} over X×{0,1}X \times \{0,1\} and all ε,δ>0\varepsilon, \delta > 0, given mm(ε,δ)m \geq m(\varepsilon, \delta) i.i.d. samples,

PrSDm[LD(A(S))ε]1δ\Pr_{S \sim \mathcal{D}^m}\bigl[L_\mathcal{D}(A(S)) \leq \varepsilon\bigr] \geq 1 - \delta

The question is: which H\mathcal{H} are PAC learnable, and how large must mm be?

Shattering and VC Dimension

A set C={x1,,xk}XC = \{x_1, \ldots, x_k\} \subset X is shattered by H\mathcal{H} if for every labeling y{0,1}Cy \in \{0,1\}^C, there exists hHh \in \mathcal{H} with hC=yh \upharpoonright C = y:

{hC:hH}=2k|\{h\upharpoonright C : h \in \mathcal{H}\}| = 2^k

The VC dimension is:

VCdim(H)=sup{C:C is shattered by H}\mathrm{VCdim}(\mathcal{H}) = \sup\{|C| : C \text{ is shattered by } \mathcal{H}\}

The Fundamental Theorem

Theorem. Let H\mathcal{H} be a hypothesis class over domain XX. The following are equivalent:

  1. H\mathcal{H} is PAC learnable.
  2. H\mathcal{H} has finite VC dimension.

Moreover, the sample complexity satisfies:

m(ε,δ)=Θ ⁣(d+log(1/δ)ε2)m(\varepsilon, \delta) = \Theta\!\left(\frac{d + \log(1/\delta)}{\varepsilon^2}\right)

where d=VCdim(H)d = \mathrm{VCdim}(\mathcal{H}).

The Sauer–Shelah Lemma

The key combinatorial engine is:

Lemma. If VCdim(H)d\mathrm{VCdim}(\mathcal{H}) \leq d, then for all mm:

Hmi=0d(mi)(emd)d|\mathcal{H}\upharpoonright_m| \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d

This polynomial (rather than exponential) growth is exactly what enables uniform convergence.

Example: Halfspaces in Rn\mathbb{R}^n

The class of linear classifiers H={sgn(wx+b):wRn,bR}\mathcal{H} = \{\mathrm{sgn}(w \cdot x + b) : w \in \mathbb{R}^n, b \in \mathbb{R}\} has VCdim=n+1\mathrm{VCdim} = n + 1. So PAC learning halfspaces in Rn\mathbb{R}^n requires O((n+log(1/δ))/ε2)O((n + \log(1/\delta))/\varepsilon^2) samples — independent of the size of the domain.

Blog