PCA is OK (Davis-Kahan sinθ\sin \theta theorem)

Back in 22 there was a bit of a flareup on twitter over a paper with the very spicy title

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

According to the abstract, there are tens (or even hundreds) of thousands of papers affected by this high bias, and

PCA results may not be reliable, robust, or replicable as the field assumes.

I would love to live in a world where you could get funding to re-evaluated a hundred thousand articles. Maybe if you say that you’ll do it by AI?

I skimmed the paper then and honestly I don’t care to re-read it. I’m just providing some context for this theorem I know of.

The Davis-Kahan theorem

It is certainly true that if you take two random samples from the same population and do a PCA on both of them, you’re not likely to get two exactly identical sets of principal component vectors. Or if you take a single sample from a population you’re not likely to get the exact true population principal components back. However there is a theorem that if the “eigengap” (described below) is large enough, the top principal components are likely to span the right space.

PCA is an eigenvalue decomposition of a symmetric matrix (the covariance matrix of your data). Suppose for simplicity we’re working in two dimensions and let our covariance matrix AA be a symmetric matrix with the eigenvalues λ1>λ2\lambda_1 > \lambda_2. In PCA terms these correspond to the “variance explained” values of your scree plot. The eigengap between these two eigenvalues is δ=λ1λ2\delta = \lambda_1 - \lambda_2. Now let u1,u2u_1, u_2 be the corresponding eigenvectors to λ1\lambda_1 and λ2\lambda_2. These are the true population principal components.

We do a random sample and get A+EA + E, a perturbation of AA (still symmetric), meaning it’s AA plus some noise. We don’t know how much noise is in there. Suppose v1v_1 and v2v_2 are the eigenvectors of A+EA+E: how reliable/robust/replicable is an inference based on v1v_1 in place of u1u_1?

The DK theorem informs us that if θ\theta is the angle between u1u_1 and v1v_1 we have

sinθ2EFδ, \sin \theta \leq \frac{\sqrt{2} \lVert E \rVert_F}{\delta},

with F\lVert \cdot \rVert_F the Frobenius norm.

In other words, the angle between our observed vector v1v_1 and the hidden true vector u1u_1 is smaller than the scale of the added noise divided by the size of the eigengap. Multiplied by an annoying constant. We get a vector close to (ie with small angle to) the truth if we have little noise or a large eigengap (or, God willing, both).

This extends to several dimensions where instead of looking at the angle between two vectors we look at the angle between subspaces. The space spanned by the top kk principal components is robust if there is a large gap between the kk-th and the k+1k+1th eigenvalues.

Bias?

I don’t believe that a hundred thousand articles are biased, in the technical sense, from use of PCA, but I 100% believe that many many articles have large noise and small eigengaps. But you can’t really blame PCA for noisy data.

References

Two nice blog posts on this sort of thing:

Original paper of D and K (did not read):

There is a paper called A useful variant of the Davis–Kahan theorem for statisticians, which has a promising title.

On measuring the angle between subspaces:

this file last touched 2024.05.28