PCA Demo

March 12, 2019

Source

A lot of R codes in this demo are taken from R codes in the excellent book "Introduction to Statistical Learning with Applications in R" by James, Witten, Hastie and Tibshirani. The Book website has R codes, labs and lots of other resources http://www.statlearning.com/.
The Crime data and Iris data codes were provided by Dr. Sumanta Basu.

Recap (PCA)

In practice, we observe $n$ observations of the $(x_1, \ldots, x_p)$.
Let $X$ be a data matrix with $n$ rows and $p$ columns. Subtract the column mean from each variable.
The first principal component direction is the vector $\beta_1 = (\beta_{11}, \ldots, \beta_{p1})'$ s.t. \[ \max_{\beta_{11}, \ldots, \beta_{p1}} \left\{ \frac{1}{n} \left( \sum_{j=1}^{p} \beta_{j1} x_{ij} \right)^2 \right\} \text{ subject to } \sum_{j=1}^{p}\beta_{j1}^2 = 1 \]

Recap (PCA)

\[ \max_{\beta_{11}, \ldots, \beta_{p1}} \left\{ \frac{1}{n} \left( \sum_{j=1}^{p} \beta_{j1} x_{ij} \right)^2 \right\} \text{ subject to } \sum_{j=1}^{p}\beta_{j1}^2 = 1 \]

$\sum_{j=1}^{p} \beta_{j1} x_{ij}$: Projection of $i^{th}$ sample into $\beta_1$ - also called score $y_{i1}$.
Maximize: Sample variance of the scores $y_{i1}$.

Second PC

The second principal component direction is the vector $\beta_2 = (\beta_{12}, \ldots, \beta_{p2})'$ s.t. \[ \max_{\beta_{12}, \ldots, \beta_{p2}} \left\{ \frac{1}{n} \left( \sum_{j=1}^{p} \beta_{j2} x_{ij} \right)^2 \right\} \\ \text{ subject to } \sum_{j=1}^{p}\beta_{j2}^2 = 1, \sum_{j=1}^{p}\beta_{j1} \beta_{j2} = 0 \]
First and second principal components must be orthogonal.
Scores $(y_{i1}, i = 1, \ldots, n)$ and $(y_{i2}, i = 1, \ldots, n)$ are independent.

Optimization for PCA

PCA can be performed in two different ways:
The singular value decomposition of $X$: \[ X = U \Sigma B^{T} \] where the $i$-th column of $B$ is the $i$-th principal component direction $\beta_i$.
The Eigen-Decomposition of $X^TX$: \[ X^T X = B \Sigma^2 B^T \]

Using R

Two packages prcomp and princomp.
Difference:

prcomp: The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.

primcomp: uses Eigen-decomposition.

US Arrest data

Reminiscient of the US Crime data, but different.

Description

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Data

states=row.names(USArrests)
states

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Variable Names

names(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

apply(USArrests, 2, mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

apply(USArrests, 2, var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

If we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by the Assault variable, since it has by far the largest mean and variance.

PCA

pr.out=prcomp(USArrests, scale=TRUE)
names(pr.out)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

pr.out$center

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

pr.out$scale

##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

Loadings

The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component loading vector.

pr.out$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

dim(pr.out$x)

## [1] 50  4

Biplots

Biplot provides an efficient way of visualizing together objects (observations) and variables.
The emphasis of the PCA of the city crime data was on cities (objects).
However, a graphical representation of the data that contains some information about the variables is particularly useful.

Biplot

biplot(pr.out, scale=0,  cex=c(0.6, 0.75),xlim=c(-4,4))

The scale=0 argument to biplot() ensures that the arrows are scaled to represent the loadings.

Fancier version

# library(devtools); install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(pr.out, labels =  rownames(USArrests))

Biplot

The principal components are only unique up to a sign change

pr.out$rotation=-pr.out$rotation; pr.out$x=-pr.out$x
biplot(pr.out, scale=0,cex=c(1/2, 1),xlim=c(-4,4))

Importance of Scaling

biplot(prcomp(USArrests, scale = F), scale=0, cex=c(1/2, 1))

Standard Deviations

pr.out$sdev

## [1] 1.5748783 0.9948694 0.5971291 0.4164494

pr.var=pr.out$sdev^2
pr.var

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

pve=pr.var/sum(pr.var)
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Proportion of Variance Explained

More Examples

Now we will see two more examples, using the princomp package.
Recall that the prcomp and princomp package both performs PCA. One does SVD on the $X$ matrix and the other does Eigen-decomposition on the $X^TX$ matrix.
They are almost identical althouch prcomp is preferred by some.

PCA for Crime Data-set

The following R commands perform PCA on the city crime dataset. The data give crime rates per 100,000 people for the 72 largest US cities in 1994.

The variables are: 1) Murder 2) Rape 3) Robbery 4) Assault 5) Burglary 6) Larceny 7) Motor Vehicle Thefts

The Data

The data are in the "city_crime.txt" file that I read next.

setwd("C:/Users/jd033/OneDrive/Documents/Course Notes/stat5443/R codes")
crime<-read.table("city_crime.txt")

This is city-wise crime data rather than state-wise like the last example.

Scatterplot Matrix

pairs(crime)

R

Then apply PCA on the data by rescaling and centering the data using the R function princomp:

pca.crime<-princomp(scale(crime,scale=TRUE,center=TRUE),cor=FALSE)

Look at the results

summary(pca.crime)

## Importance of components:
##                          Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.934449 1.0873855 0.8710005 0.70936242 0.57288477
## Proportion of Variance 0.542114 0.1712944 0.1099039 0.07289747 0.04754564
## Cumulative Proportion  0.542114 0.7134084 0.8233123 0.89620976 0.94375539
##                           Comp.6     Comp.7
## Standard deviation     0.4585486 0.42187341
## Proportion of Variance 0.0304612 0.02578341
## Cumulative Proportion  0.9742166 1.00000000

Loadings

loadings(pca.crime)

## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Murder    0.370  0.339  0.202  0.717  0.277  0.220  0.262
## Rape      0.249 -0.467  0.783 -0.159         0.267       
## Robbery   0.426  0.388               -0.194 -0.147 -0.776
## Assault   0.434        -0.282        -0.767  0.118  0.358
## Burglary  0.450 -0.238                0.242 -0.794  0.220
## Larceny   0.276 -0.606 -0.492  0.209  0.264  0.303 -0.331
## MVT       0.390  0.303 -0.134 -0.644  0.399  0.351  0.204
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000

Proportion of Variance

PoV <- pca.crime$sdev^2/sum(pca.crime$sdev^2)
barplot(PoV, main="Prop Var Exp")

How many?

screeplot(pca.crime, type="line")

Calculate the principal component scores

Once we calculate the principal components, we can plot them against each other in order to produce low-dimensional views of the data.

First we calculate the principal component scores.

pcs.crime<-predict(pca.crime)

Plot the first 2 PCs.

plot(pcs.crime[,1:2],type="n",xlab='1st PC',ylab='2nd PC') 
text(pcs.crime[,1:2],row.names(crime))

Biplot

biplot(pca.crime, cex = c(0.5,1))

Questions

Which cities are most safe? Which are most crime-prone?
How to interpret high/low values in the second PC?

Subtle difference

For scaling, prcomp uses $n-1$ as denominator but princomp uses $n$ as its denominator.

pc.cr<-prcomp(crime, scale=TRUE, cor=TRUE, scores=TRUE)
pc.cr1<-princomp(crime, scale=TRUE, cor=TRUE, scores=TRUE)
pc.cr$scale

##     Murder       Rape    Robbery    Assault   Burglary    Larceny 
##   15.49703   28.21087  394.73973  474.47209  562.34922 1345.85980 
##        MVT 
##  728.40358

pc.cr1$scale

##     Murder       Rape    Robbery    Assault   Burglary    Larceny 
##   15.38904   28.01427  391.98890  471.16562  558.43036 1336.48087 
##        MVT 
##  723.32753

Iris Data

data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

dat.iris = data.matrix(iris[,1:4])

Add noise and shuffle columns

dat.iris = cbind(dat.iris, array(rnorm(150*6), c(150, 6)))
colnames(dat.iris)[5:10] <- paste('V', 1:6, sep='')
dat.iris = dat.iris[,sample(10, 10, replace=FALSE)]

Now we have original variables as well as noise variables but we don't know which is which. Let's see if PCA can still detect the species.

Center and Scale before PCA

dat.iris = scale(dat.iris, center=TRUE, scale=TRUE)

Perform principal component analysis on correlation matrix

pca.iris <- princomp(dat.iris, cor=TRUE)

Summarize PCA

summary(pca.iris)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.7159958 1.1066430 1.1013361 1.0326773 0.99834832
## Proportion of Variance 0.2944642 0.1224659 0.1212941 0.1066423 0.09966994
## Cumulative Proportion  0.2944642 0.4169300 0.5382242 0.6448664 0.74453635
##                            Comp.6     Comp.7     Comp.8     Comp.9
## Standard deviation     0.91558184 0.90169810 0.86654437 0.36327725
## Proportion of Variance 0.08382901 0.08130595 0.07508991 0.01319704
## Cumulative Proportion  0.82836537 0.90967131 0.98476123 0.99795826
##                            Comp.10
## Standard deviation     0.142889407
## Proportion of Variance 0.002041738
## Cumulative Proportion  1.000000000

Choose no. of PCs to look into

screeplot(pca.iris, type="line")

Calculate PC scores

pcs.iris <- predict(pca.iris)

Set color code for observations from different groups

col.code = as.character(iris$Species)
col.code[col.code == "setosa"] = 'red'
col.code[col.code == "versicolor"] = 'blue'
col.code[col.code == "virginica"] = 'green'

Plot first two principal components

plot(pcs.iris[,1:2], type='p', pch=15, col=col.code, xlab="PC1", ylab="PC2")
legend("topright", col=c('red', 'blue', 'green')
     , legend = c('setosa', 'versicolor', 'virginica'), pch = 15
      )

Importance of PC

Plot first two variables $X[,1:2]$ instead of the first two principal components.
Do we see any clustering?

Importance of PC

When we have many variables, projecting observations into top few PC scores help avoid searching over important variables.