Principal Component Analysis (PCA)


Introduction

Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

There are five main steps to PCA:

  1. Standardize the data, subtracting the center.
  2. Calculate the covariance matrix.
  3. Calculate the eigenvectors and eigenvalues of the covariance matrix.
  4. Sort the eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues.
  5. Construct the projection matrix W from the selected k eigenvectors, and transform the original dataset X via W to obtain a k-dimensional feature subspace Y.

Sometimes, we can get the eigenvalues and eigenvectors of the covariance matrix by performing singular value decomposition (SVD) on the data matrix, which is a more numerically stable approach to obtaining the eigendecomposition.

Definitions

Let \(X_{D \times N}\) be a data matrix with \(D\) features and \(N\) samples.

\[X_{D \times N} = \begin{bmatrix} x_1 & x_2 & \cdots & x_N \end{bmatrix} = \begin{bmatrix} x_{1,1} & x_{2,1} & \cdots & x_{N,1} \\ x_{1,2} & x_{2,2} & \cdots & x_{N,2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1,D} & x_{2,D} & \cdots & x_{N,D} \end{bmatrix}\]

The projection matrix \(W_{D \times k}\) is used to transform the original data matrix \(X_{D \times N}\) into a new k-dimensional feature subspace \(Z_{k \times N}\).

\[W_{D \times k} = \begin{bmatrix} w_1 & w_2 & \cdots & w_k \end{bmatrix} = \begin{bmatrix} w_{1,1} & w_{2,1} & \cdots & w_{k,1} \\ w_{1,2} & w_{2,2} & \cdots & w_{k,2} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1,D} & w_{2,D} & \cdots & w_{k,D} \end{bmatrix}\] \[Z_{k \times N} = W^TX = \begin{bmatrix} w_1^T\\ w_2^T\\ \vdots\\ w_k^T \end{bmatrix} \begin{bmatrix} x_1 & x_2 & \cdots & x_N \end{bmatrix} = \begin{bmatrix} w_1^Tx_1 & w_1^Tx_2 & \cdots & w_1^Tx_N \\ w_2^Tx_1 & w_2^Tx_2 & \cdots & w_2^Tx_N \\ \vdots & \vdots & \ddots & \vdots \\ w_k^Tx_1 & w_k^Tx_2 & \cdots & w_k^Tx_N \end{bmatrix} = \begin{bmatrix} z_1 & z_2 & \cdots & z_N \end{bmatrix}\]

Usually, we want the matrix W to correspond to a set of orthogonal bases.

\[w_i^Tw_j = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}\] \[W^TW = \begin{bmatrix} w_1^T\\ w_2^T\\ \vdots\\ w_k^T \end{bmatrix} \begin{bmatrix} w_1 & w_2 & \cdots & w_k \end{bmatrix} = \begin{bmatrix} w_1^Tw_1 & w_1^Tw_2 & \cdots & w_1^Tw_k \\ w_2^Tw_1 & w_2^Tw_2 & \cdots & w_2^Tw_k \\ \vdots & \vdots & \ddots & \vdots \\ w_k^Tw_1 & w_k^Tw_2 & \cdots & w_k^Tw_k \end{bmatrix} = I\]

Since \(W\) is a orthogonal matrix, we can easily see that \(W^T = W^{-1}\). Therefore, we can easily reconstruct the original data matrix \(X\) from \(Z\).

\[\begin{aligned} Z &= W^TX \\ WZ &= W^TWX \\ WZ &= IX \\ X &= WZ \end{aligned}\]

Proof

The goal of PCA is to minimize the reconstruction error, which is the sum of the squared distances between the original data points and their projections onto the new subspace.

\[\min_{W} \sum_{i=1}^N ||x_i - Wz_i||^2\] \[\begin{aligned} \sum_{i=1}^N ||x_i - Wz_i||^2 &= \sum_{i=1}^N (x_i - Wz_i)^T(x_i - Wz_i) \\ &= \sum_{i=1}^N (x_i^T - z_i^TW^T)(x_i - Wz_i) \\ &= \sum_{i=1}^N (x_i^Tx_i - x_i^TWz_i - z_i^TW^Tx_i + z_i^TW^TWz_i) \\ &= \sum_{i=1}^N (x_i^Tx_i - 2x_i^TWz_i + z_i^TW^TWz_i) \\ &= \sum_{i=1}^N (x_i^Tx_i - 2x_i^TWz_i + z_i^Tz_i) \\ &= \sum_{i=1}^N (x_i^Tx_i - 2z_i^Tz_i + z_i^Tz_i) \\ &= \sum_{i=1}^N (x_i^Tx_i - z_i^Tz_i) \\ \end{aligned}\]

Therefore, we can see that minimizing the reconstruction error is equivalent to maximizing the variance of the projected data (\(z_i^Tz_i\)), since \(x_i^Tx_i\) is a constant.

\[\max_{W} \sum_{i=1}^N z_i^Tz_i = \max\limits_{W}\;tr(W^TXX^TW) = \max\limits_{W}\;tr(W^TSW)\]

where \(S\) is the covariance matrix of \(X\), since \(X\) is centered. Optimization can be done using Lagrange multipliers, and the solution is given by the eigenvectors of \(S\). We can roll out that \(w_i\) is the \(i\)th largest eigenvector of \(S\), and the corresponding eigenvalue is the \(i\)th largest eigenvalue of \(S\).

Using SVD to obtain the eigendecomposition

Singluar value decomposition (SVD) is a matrix decomposition method that can be used to decompose a matrix \(X\) into three matrices \(U\), \(\Sigma\), and \(V\).

\[X = U\Sigma V^T\]

where \(U\) and \(V\) are orthogonal matrices, and \(\Sigma\) is a diagonal matrix, with the diagonal entries being the singular values of \(X\). The singular values are the square roots of the eigenvalues of \(X^TX\) and each column of \(U\) and \(V\) are the eigenvectors of \(XX^T\) and \(X^TX\) respectively.

\[\begin{aligned} X^TX &= (U\Sigma V^T)^T(U\Sigma V^T) \\ &= V\Sigma^T U^T U\Sigma V^T \\ &= V\Sigma^T \Sigma V^T \\ &= V\Sigma^2 V^T\\ XX^T &= (U\Sigma V^T)(U\Sigma V^T)^T \\ &= U\Sigma V^T V\Sigma^T U^T \\ &= U\Sigma \Sigma^T U^T \\ &= U\Sigma^2 U^T \end{aligned}\]

The \(\Sigma^2\) matrix is a diagonal matrix with the diagonal entries being the squared singular values of \(X\), which are the eigenvalues of \(X^TX\) and \(XX^T\).

It is worth noting that the diagonal entries of \(\Sigma\) is sorted in descending order, and the columns of \(U\) and \(V\) are sorted in the same order as the diagonal entries of \(\Sigma\). Therefore, we can easily obtain the eigendecomposition of \(X^TX\) and \(XX^T\) by performing SVD on \(X\).

Above, the number of columns of our matrix \(X\) is the number of samples, and the number of rows is the number of features. The corresponding covariance matrix is therefore \(XX^T\), so we take the first \(k\) columns of \(U\) to obtain the eigenvectors of \(XX^T\), to be used as the projection matrix \(W\).

\[W = \begin{bmatrix} u_1 & u_2 & \cdots & u_k \end{bmatrix}\]

If your matrix \(X\) is organized such that the number of rows is the number of samples, and the number of columns is the number of features, then the corresponding covariance matrix is \(X^TX\), and you should take the first \(k\) columns of \(V\) to obtain the eigenvectors of \(X^TX\).

Choosing the number of principal components (eigenvectors)

The number of principal components to keep is a hyperparameter that needs to be tuned. One way to do this is to plot the cumulative explained variance ratio as a function of the number of components, and choose the number of components that add up to a large portion of the variance (e.g. 95%). The explained variance ratio is percentage of variance explained by each of the selected components, and is calculated by dividing the eigenvalue of each component by the sum of all eigenvalues.

Implementation