Principal Component Analysis (PCA)
Principal component analysis termed as PCA is a standard multivariate data analyis exploratory tool. It is used
- to reduce the dimensionality of a complex data set without much loss of information
- to extract the most important information from the data table.
- To identify noise and outlier in the data set.
- To visualize the pattern of grouping based on similarities and dis-similarities in the data set.
The procedure of PCA is based on to convert a set of correlated variables into a new set of uncorrelated variables called principal components.
PCA redistributes the total variance of the data set in such a way that the first principal component has maximum variance, followed by second component and so on.
Variance PC1 > Variance PC2 > … Variance PCk
Total variance = Variance PC1 + Variance PC2 + … Variance PCk
The covariance of any of the principal component with any other principal component is zero (uncorrelated) and they are orthogonal to each other.
The Basics of PCA
Below we have a data set , which is represented by two variables and . Let’s say we are trying to understand some phenomenon using these two variables in our system. However, we can’t figure out what is happening because the data appears clouded and redundant. So how can we re-express the data to make it easier to understand the important underlying information?
From the above figure, we can see there’s an obvious correlation between and . So, first, let’s decorrelate them. The most obvious way is to rotate the axes so that and are transformed to the new variables and
The second figure above shows that most of the variance in the data is along the variable . To further simplify the data set, the next step is to reduce the dimension via mapping the data points onto . Even if the variable is ignored, the main variance (important information) of the data is preserved, as illustrated by the following figure:
This example demonstrates the purpose of principal component analysis – to reveal important underlying information by reducing a complex data set to a lower dimension.
In this example, we only dealt with a two-dimensional problem, but what if the dimension of the data set is ten? If we can use PCA to decorrelate it and reduce its dimension, say into two dimensions, while simultaneously preserving the important information, that would make it easier for us to apply those data to our trading system. That’s the power of PCA. It is
Traditionally, principal component analysis is performed on the symmetric Covariance matrix or on the symmetric Correlation matrix. These matrices can be calculated from the data matrix. The covariance matrix contains scaled sums of squares and cross products. A correlation matrix is like a covariance matrix but first the variables, i.e. the columns, have been standardized. We will have to standardize the data first if the variances of variables differ much, or if the units of measurement of the variables differ. You can standardize the data in the TableOfReal by choosing Standardize columns.
Covariance and Correlation
- Variance and Covariance are a measure of the “spread” of a set of points around their center of mass (mean)
• Variance – measure of the deviation from the mean for points in one dimension e.g. heights
• Covariance as a measure of how much each of the dimensions varies from the mean with respect to each other.
• Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions e.g. number of hours studied & marks obtained.
• The covariance between one dimension and itself is the variance covariance
• So, if you had a 3-dimensional data set (x,y,z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions. Measuring the covariance between x and x , or y and y , or z and z would give you the variance of the x , y and z dimensions respectively.
The correlation between two such variables is the covariance normalized so that it lays between the values -1 and 1. The normalization occurs by dividing the covariance by the product of the standard deviations for the two variables.
Variables that are directly related will have correlations near one, while those inversely related will have correlations around -1. Again, the correlation is near zero for unrelated variables.
Covariance Matrix
Covariance Matrix tells us the correlations of data between dimensions of data
• Representing Covariance between dimensions as a matrix e.g. for 3 dimensions:
- Diagonal is the variances of x, y and z
- cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal
- N-dimensional data will result in NxN covariance matrix
The general rule is that each term in the first factor has to multiply each term in the other factor i.e.
(Var1 – λ)* (Var2 – λ) = Var1 x Var2 + Var1 (– λ) + (– λ) x Var2 + (– λ) * (– λ)
= Var1 x Var2 - λ (Var1 + Var2) + λ2
In the above expression –λ is taken common
So substituting the result of multiplication of (Var1 – λ)* (Var2 – λ) in the expression
The net result we get as bellow
λ2 - λ (Var1 + Var2) + [Var1 x Var2 – Cov (1, 2)2]
λ2 – λ (6.4228 + 9.9528) + [6.4228 x 9.9528 – 7.9876 x 7.9876]
λ2 – λ 16.3756 + 0.12214
Where λ = 16.368099, so substituting the magnitude of λ in covariance matrix as