Skip to main content
Chapter 10

Non-parametric Statistics and PCA: Data Analysis Beyond Common Sense

#Non-parametric Statistics#Rank Sum Test#PCA (Principal Component Analysis)#Dimensionality Reduction

Non-parametric Statistics and PCA: Getting Closer to the Essence of Data

Up until now, we have performed analysis assuming that data follows a beautiful bell shape (normal distribution). However, in reality, there is plenty of data that does not follow normality at all, or where it is difficult to know what is important because there are too many variables.

1. Non-parametric Statistics: Setting Aside Assumptions to Gain Freedom

When the amount of data is too small or the distribution is extremely skewed, we use ‘ranks’ instead of the ‘values’ of the data.

Comparison of Parametric vs. Non-parametric Statistics

CategoryParametricNon-parametric
AssumptionsFollows a normal distributionNo assumptions about the distribution
Data TypeContinuous numerical dataOrdinal, rank, nominal data
Representative Analysist-test, ANOVAWilcoxon, Kruskal-Wallis
Pros and ConsHigh accuracy, but meaningless if assumptions are brokenSlightly lower accuracy, but applicable anywhere

2. Principal Component Analysis (PCA): Compression and Summarization of Information

It is nearly impossible to analyze data with 100 variables. PCA (Principal Component Analysis) is a magical technique that significantly reduces the number of variables to 2-3 while maintaining as much information as possible.

1
Data Standardization

Since each variable has different units, they are aligned to mean 0 and variance 1.

2
Covariance Matrix Calculation

Create a map of how the variables change together.

3
Extraction of Eigenvalues and Eigenvectors

Find the 'principal component directions' where the data is most scattered.

4
Dimensionality Reduction

Retain only the most important 1st and 2nd principal components and discard the rest.

3. Effect of PCA: Amount of Explained Variance

Below shows how much each principal component explains the entire data when 10 variables are reduced to 5 through PCA.

Proportion of Explained Variance by Principal Component (Scree Plot)

Shows that approximately 75% of the total information can be explained with only the 1st and 2nd principal components.


💡 Professor’s Tip

PCA is not simply about ‘reducing data’; it’s about ‘finding the skeleton of data.’ It provides the insight to find the direction in which truly meaningful signals extend amidst numerous noises.

🔗 Next Step