Principal Component Analysis (PCA)

Introduction

Principal Component Analysis (PCA) is a powerful technique used in data analysis, particularly for reducing the dimensionality of datasets while preserving crucial information. It does this by transforming the original variables into a set of new, uncorrelated variables called principal components. Here’s a breakdown of PCA’s key aspects:

Dimensionality Reduction: PCA helps manage high-dimensional datasets by extracting essential information and discarding less relevant features, simplifying analysis.

Data Exploration and Visualization: It plays a significant role in data exploration and visualization, aiding in uncovering hidden patterns and insights.

Linear Transformation: PCA performs a linear transformation of data, seeking directions of maximum variance.

Feature Selection: Principal components are ranked by the variance they explain, allowing for effective feature selection.

Data Compression: PCA can compress data while preserving most of the original information.

Clustering and Classification: It finds applications in clustering and classification tasks by reducing noise and highlighting underlying structure.

Matrix Requirements: PCA works with symmetric correlation or covariance matrices and requires numeric, standardized data.

Eigenvalues and Eigenvectors: Eigenvalues represent variance magnitude, and eigenvectors indicate variance direction.

Number of Components: The number of principal components chosen determines the number of eigenvectors computed.

PCA将相关性高的变量转变为较少的独立新变量，实现用较少的综合指标分别代表存在于各个变量中的各类信息，既减少高维数据的变量维度，又尽量降低原变量数据包含信息的损失程度，是一种典型的数据降维方法。PCA保留了高维数据最重要的一部分特征，去除了数据集中的噪声和不重要特征，这种方法在承受一定范围内的信息损失的情况下节省了大量时间和资源，是一种应用广泛的数据预处理方法。

How does PCA work?

The steps involved for PCA are as follows-

STEP 1: STANDARDIZATION

STEP 2: COVARIANCE MATRIX COMPUTATION

Step 3 - Identify the Principal Components (PCs)

Step 4 - Feature Vector

Step 5 - Recast the data along the PCs

STEP 1: STANDARDIZATION

It is important to standardize the continuous features.

This is important because the features with larger intervals may be considered as "more
important", biasing the result

STEP 2: COVARIANCE MATRIX COMPUTATION

It is a pxp matrix which has as entries all possible pairs of the initial variables.

We are interested into the sign of the covariance. If positive, they are correlated, and vice versa

Step 3 - Identify the Principal Components (PCs)

In this step, you perform eigenvalue decomposition or singular value decomposition (SVD) on the covariance matrix.

The eigenvectors (or singular vectors) represent the principal components, and the corresponding eigenvalues (or singular values) indicate the amount of variance explained by each principal component.

You can then sort the eigenvectors in descending order of their eigenvalues to identify the top principal components.

Step 4 - Feature Vector

We choose how many principal components to include, based on the total explained variance

The feature vector represents the direction of each principal component in the original feature space. It is formed by stacking the selected eigenvectors horizontally.

Each feature vector corresponds to one principal component.

Step 5 - Recast the data along the PCs

Leverage the selected feature vector to re-orient the data from the original axes to the ones represented by means of the principal components!

Finally, you project the original data onto the new lower-dimensional space defined by the selected principal components (eigenvectors).

This is done by multiplying the standardized data by the feature vectors corresponding to the selected principal components.

The resulting transformed data represents the original data in terms of the principal components.

Advantage for Principal Component Analysis

Used for Dimensionality Reduction

PCA will assist you in eliminating all related features, sometimes referred to as multi-collinearity.

The time required to train your model is now substantially shorter because to PCA’s reduction in the number of features.

PCA aids in overcoming overfitting by eliminating the extraneous features from your dataset.

缓解维度灾难：PCA 算法通过舍去一部分信息之后能使得样本的采样密度增大（因为维数降低了），这是缓解维度灾难的重要手段；

降噪：当数据受到噪声影响时，最小特征值对应的特征向量往往与噪声有关，将它们舍弃能在一定程度上起到降噪的效果；

过拟合：PCA 保留了主要信息，但这个主要信息只是针对训练集的，而且这个主要信息未必是重要信息。有可能舍弃了一些看似无用的信息，但是这些看似无用的信息恰好是重要信息，只是在训练集上没有很大的表现，所以 PCA 也可能加剧了过拟合；

特征独立：PCA 不仅将数据压缩到低维，它也使得降维之后的数据各特征相互独立；

Disadvantage for Principal Component Analysis

Useful for quantitative data but not effective with qualitative data.

Interpretation of PC is difficult from original data

Python


import numpy as np
from sklearn.decomposition import PCA

# Example data (replace with your dataset)
data = np.array([
    [1.2, 2.3, 3.4, 4.5],
    [2.1, 3.2, 4.3, 5.4],
    [3.0, 4.1, 5.2, 6.3],
    [4.0, 5.1, 6.2, 7.3]
])

# Create a PCA object and specify the number of components to retain
n_components = 2
pca = PCA(n_components=n_components)

# Fit the PCA model to the data and transform it
principal_components = pca.fit_transform(data)

explained_variance = pca.explained_variance_ratio_
print("Explained Variance:", explained_variance)

# Access the explained variance
explained_variance = pca.explained_variance_ratio_
print("Explained Variance:", explained_variance)

# Visualize the data in the new principal component space
plt.scatter(principal_components[:, 0], principal_components[:, 1])
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Visualization")
plt.show()