Lab 5

Introduction to Statistical Learning - PISE

Author

Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Unsupervised Learning

Principal Components Analysis

PCA is performed on the USArrests dataset (included in base R)
Rows represent the 50 U.S. states
States are listed in alphabetical order

states <- row.names(USArrests)
states

 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"

Columns of the dataset correspond to the four variables

names(USArrests)

[1] "Murder"   "Assault"  "UrbanPop" "Rape"

Initial data exploration is performed
Variables have very different mean values

apply(USArrests, 2, mean)

  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232

apply() applies a function (e.g., mean()) to rows or columns of a dataset
Second argument specifies dimension (1 = rows, 2 = columns)
Data shows large scale differences: rapes ≈ 3× murders, assaults ≈ 8× rapes
apply() can also be used to compute variances of variables

apply(USArrests, 2, var)

    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916

Variables have very different variances
UrbanPop (percentage) is not comparable to crime rates (per 100,000 people)
Without scaling, PCA would be dominated by Assault (largest mean and variance)
Therefore, variables must be standardized (mean = 0, sd = 1) before PCA
PCA is performed using the prcomp() function in R

pr.out <- prcomp(USArrests, scale = TRUE)

prcomp() centers variables to mean 0 by default
scale = TRUE standardizes variables to standard deviation 1
Output of prcomp() includes several useful components

names(pr.out)

[1] "sdev"     "rotation" "center"   "scale"    "x"

center: means of variables used for centering
scale: standard deviations used for scaling before PCA

pr.out$center

  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232

pr.out$scale

   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385

rotation matrix contains the principal component loadings
Each column represents a principal component loading vector
Multiplying the data matrix X by rotation gives the principal component scores
Called “rotation” because it represents a rotation of the original coordinate system

pr.out$rotation

                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432

There are 4 principal components
In general, number of informative PCs = min(n − 1, p)
prcomp() directly provides principal component scores
pr.out$x is a 50 × 4 matrix of scores
Each column corresponds to a principal component score vector

dim(pr.out$x)

[1] 50  4

The first two principal components can be visualized using a plot

biplot(pr.out, scale = 0)

In biplot(), scale = 0 scales arrows to represent loadings
Different scale values change interpretation of the biplot
PCA results are only unique up to a sign change
Plots may appear as mirror images due to sign ambiguity
Sign flipping can reproduce equivalent visualizations

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale = 0)

prcomp() outputs the standard deviation of each principal component
These can be accessed via pr.out$sdev

pr.out$sdev

[1] 1.5748783 0.9948694 0.5971291 0.4164494

Variance explained by each principal component is obtained by squaring the standard deviations
Computed as: pr.var <- pr.out$sdev^2

pr.var <- pr.out$sdev^2
pr.var

[1] 2.4802416 0.9897652 0.3565632 0.1734301

Proportion of variance explained (PVE) is computed by dividing each component’s variance by total variance
Computed as: pve <- pr.var / sum(pr.var)

pve <- pr.var / sum(pr.var)
pve

[1] 0.62006039 0.24744129 0.08914080 0.04335752

First principal component explains ~62.0% of the variance
Second principal component explains ~24.7%
Remaining components explain smaller proportions
PVE and cumulative PVE can be visualized with plots

par(mfrow = c(1, 2))
plot(pve, xlab = "Principal Component",
    ylab = "Proportion of Variance Explained", ylim = c(0, 1),
    type = "b")
plot(cumsum(pve), xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")

Results are visualized in a plot (Figure 12.3)
cumsum() computes the cumulative sum of a numeric vector
Used to calculate cumulative proportion of variance explained

a <- c(1, 2, 8, -3)
cumsum(a)

[1]  1  3 11  8

Matrix Completion

Recreate analysis on USArrests dataset
Convert data frame to matrix
Center and scale each column (mean = 0, variance = 1)

X <- data.matrix(scale(USArrests))
pcob <- prcomp(X)
summary(pcob)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

First principal component explains ~62% of the variance
PCA can be formulated as an optimization problem on centered data
Computing the first M principal components solves this problem
Singular Value Decomposition (SVD) is a general method to compute PCA

sX <- svd(X)
names(sX)

[1] "d" "u" "v"

round(sX$v, 3)

       [,1]   [,2]   [,3]   [,4]
[1,] -0.536 -0.418  0.341  0.649
[2,] -0.583 -0.188  0.268 -0.743
[3,] -0.278  0.873  0.378  0.134
[4,] -0.543  0.167 -0.818  0.089

svd() returns three components: u, d, and v
Matrix v corresponds to PCA loadings (up to a sign flip)

pcob$rotation

                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432

Matrix u contains standardized principal component scores
Vector d contains singular values (related to standard deviations)
PCA scores can be recovered from SVD output
These scores are identical to those from prcomp()

t(sX$d * t(sX$u))

             [,1]        [,2]        [,3]         [,4]
 [1,] -0.97566045 -1.12200121  0.43980366  0.154696581
 [2,] -1.93053788 -1.06242692 -2.01950027 -0.434175454
 [3,] -1.74544285  0.73845954 -0.05423025 -0.826264240
 [4,]  0.13999894 -1.10854226 -0.11342217 -0.180973554
 [5,] -2.49861285  1.52742672 -0.59254100 -0.338559240
 [6,] -1.49934074  0.97762966 -1.08400162  0.001450164
 [7,]  1.34499236  1.07798362  0.63679250 -0.117278736
 [8,] -0.04722981  0.32208890  0.71141032 -0.873113315
 [9,] -2.98275967 -0.03883425  0.57103206 -0.095317042
[10,] -1.62280742 -1.26608838  0.33901818  1.065974459
[11,]  0.90348448  1.55467609 -0.05027151  0.893733198
[12,]  1.62331903 -0.20885253 -0.25719021 -0.494087852
[13,] -1.36505197  0.67498834  0.67068647 -0.120794916
[14,]  0.50038122  0.15003926 -0.22576277  0.420397595
[15,]  2.23099579  0.10300828 -0.16291036  0.017379470
[16,]  0.78887206  0.26744941 -0.02529648  0.204421034
[17,]  0.74331256 -0.94880748  0.02808429  0.663817237
[18,] -1.54909076 -0.86230011  0.77560598  0.450157791
[19,]  2.37274014 -0.37260865  0.06502225 -0.327138529
[20,] -1.74564663 -0.42335704  0.15566968 -0.553450589
[21,]  0.48128007  1.45967706  0.60337172 -0.177793902
[22,] -2.08725025  0.15383500 -0.38100046  0.101343128
[23,]  1.67566951  0.62590670 -0.15153200  0.066640316
[24,] -0.98647919 -2.36973712  0.73336290  0.213342049
[25,] -0.68978426  0.26070794 -0.37365033  0.223554811
[26,]  1.17353751 -0.53147851 -0.24440796  0.122498555
[27,]  1.25291625  0.19200440 -0.17380930  0.015733156
[28,] -2.84550542  0.76780502 -1.15168793  0.311354436
[29,]  2.35995585  0.01790055 -0.03648498 -0.032804291
[30,] -0.17974128  1.43493745  0.75677041  0.240936580
[31,] -1.96012351 -0.14141308 -0.18184598 -0.336121113
[32,] -1.66566662  0.81491072  0.63661186 -0.013348844
[33,] -1.11208808 -2.20561081  0.85489245 -0.944789648
[34,]  2.96215223 -0.59309738 -0.29824930 -0.251434626
[35,]  0.22369436  0.73477837  0.03082616  0.469152817
[36,]  0.30864928  0.28496113  0.01515592  0.010228476
[37,] -0.05852787  0.53596999 -0.93038718 -0.235390872
[38,]  0.87948680  0.56536050  0.39660218  0.355452378
[39,]  0.85509072  1.47698328  1.35617705 -0.607402746
[40,] -1.30744986 -1.91397297  0.29751723 -0.130145378
[41,]  1.96779669 -0.81506822 -0.38538073 -0.108470512
[42,] -0.98969377 -0.85160534 -0.18619262  0.646302674
[43,] -1.34151838  0.40833518  0.48712332  0.636731051
[44,]  0.54503180  1.45671524 -0.29077592 -0.081486749
[45,]  2.77325613 -1.38819435 -0.83280797 -0.143433697
[46,]  0.09536670 -0.19772785 -0.01159482  0.209246429
[47,]  0.21472339  0.96037394 -0.61859067 -0.218628161
[48,]  2.08739306 -1.41052627 -0.10372163  0.130583080
[49,]  2.05881199  0.60512507  0.13746933  0.182253407
[50,]  0.62310061 -0.31778662  0.23824049 -0.164976866

pcob$x

                       PC1         PC2         PC3          PC4
Alabama        -0.97566045 -1.12200121  0.43980366  0.154696581
Alaska         -1.93053788 -1.06242692 -2.01950027 -0.434175454
Arizona        -1.74544285  0.73845954 -0.05423025 -0.826264240
Arkansas        0.13999894 -1.10854226 -0.11342217 -0.180973554
California     -2.49861285  1.52742672 -0.59254100 -0.338559240
Colorado       -1.49934074  0.97762966 -1.08400162  0.001450164
Connecticut     1.34499236  1.07798362  0.63679250 -0.117278736
Delaware       -0.04722981  0.32208890  0.71141032 -0.873113315
Florida        -2.98275967 -0.03883425  0.57103206 -0.095317042
Georgia        -1.62280742 -1.26608838  0.33901818  1.065974459
Hawaii          0.90348448  1.55467609 -0.05027151  0.893733198
Idaho           1.62331903 -0.20885253 -0.25719021 -0.494087852
Illinois       -1.36505197  0.67498834  0.67068647 -0.120794916
Indiana         0.50038122  0.15003926 -0.22576277  0.420397595
Iowa            2.23099579  0.10300828 -0.16291036  0.017379470
Kansas          0.78887206  0.26744941 -0.02529648  0.204421034
Kentucky        0.74331256 -0.94880748  0.02808429  0.663817237
Louisiana      -1.54909076 -0.86230011  0.77560598  0.450157791
Maine           2.37274014 -0.37260865  0.06502225 -0.327138529
Maryland       -1.74564663 -0.42335704  0.15566968 -0.553450589
Massachusetts   0.48128007  1.45967706  0.60337172 -0.177793902
Michigan       -2.08725025  0.15383500 -0.38100046  0.101343128
Minnesota       1.67566951  0.62590670 -0.15153200  0.066640316
Mississippi    -0.98647919 -2.36973712  0.73336290  0.213342049
Missouri       -0.68978426  0.26070794 -0.37365033  0.223554811
Montana         1.17353751 -0.53147851 -0.24440796  0.122498555
Nebraska        1.25291625  0.19200440 -0.17380930  0.015733156
Nevada         -2.84550542  0.76780502 -1.15168793  0.311354436
New Hampshire   2.35995585  0.01790055 -0.03648498 -0.032804291
New Jersey     -0.17974128  1.43493745  0.75677041  0.240936580
New Mexico     -1.96012351 -0.14141308 -0.18184598 -0.336121113
New York       -1.66566662  0.81491072  0.63661186 -0.013348844
North Carolina -1.11208808 -2.20561081  0.85489245 -0.944789648
North Dakota    2.96215223 -0.59309738 -0.29824930 -0.251434626
Ohio            0.22369436  0.73477837  0.03082616  0.469152817
Oklahoma        0.30864928  0.28496113  0.01515592  0.010228476
Oregon         -0.05852787  0.53596999 -0.93038718 -0.235390872
Pennsylvania    0.87948680  0.56536050  0.39660218  0.355452378
Rhode Island    0.85509072  1.47698328  1.35617705 -0.607402746
South Carolina -1.30744986 -1.91397297  0.29751723 -0.130145378
South Dakota    1.96779669 -0.81506822 -0.38538073 -0.108470512
Tennessee      -0.98969377 -0.85160534 -0.18619262  0.646302674
Texas          -1.34151838  0.40833518  0.48712332  0.636731051
Utah            0.54503180  1.45671524 -0.29077592 -0.081486749
Vermont         2.77325613 -1.38819435 -0.83280797 -0.143433697
Virginia        0.09536670 -0.19772785 -0.01159482  0.209246429
Washington      0.21472339  0.96037394 -0.61859067 -0.218628161
West Virginia   2.08739306 -1.41052627 -0.10372163  0.130583080
Wisconsin       2.05881199  0.60512507  0.13746933  0.182253407
Wyoming         0.62310061 -0.31778662  0.23824049 -0.164976866

Although prcomp() could be used, svd() is used here for illustration
Randomly remove 20 entries from the 50 × 4 data matrix
Procedure:
- Select 20 rows (states) at random
- For each selected row, randomly remove one of the four variables
Ensures each row still has at least three observed values

nomit <- 20
set.seed(15)
ina <- sample(seq(50), nomit)
inb <- sample(1:4, nomit, replace = TRUE)
Xna <- X
index.na <- cbind(ina, inb)
Xna[index.na] <- NA

ina: indices of rows (states) with missing values
inb: indices of columns (features) with missing values
index.na: matrix combining row and column indices for missing entries
Demonstrates indexing a matrix using a matrix of indices
Implement Algorithm 12.1 for matrix completion
Define a function to approximate the matrix using svd()
This function is used in Step 2 of the algorithm
svd() is used instead of prcomp() for illustration purposes

fit.svd <- function(X, M = 1) {
   svdob <- svd(X)
   with(svdob,
       u[, 1:M, drop = FALSE] %*%
       (d[1:M] * t(v[, 1:M, drop = FALSE]))
     )
}

Explicit return() is not needed in R; last computed value is returned automatically
with() is used to simplify access to elements of svdob
Alternative: directly reference components (e.g., svdob$u, svdob$d, svdob$v) inside the function

svdob$u[, 1:M, drop = FALSE] %*%
  (svdob$d[1:M] * t(svdob$v[, 1:M, drop = FALSE]))

Step 1: initialize Xhat (≈ \tilde{\mathbf{X}})
Replace missing values with column means (computed from non-missing entries)

Xhat <- Xna
xbar <- colMeans(Xna, na.rm = TRUE)
Xhat[index.na] <- xbar[inb]

Prepare variables to track progress of the iterative algorithm
Initialize error measures and convergence threshold
Set up structures to monitor changes across iterations

thresh <- 1e-7
rel_err <- 1
iter <- 0
ismiss <- is.na(Xna)
mssold <- mean((scale(Xna, xbar, FALSE)[!ismiss])^2)
mss0 <- mean(Xna[!ismiss]^2)

ismiss: logical matrix indicating missing entries (TRUE = missing)
Allows separation of missing vs non-missing values
Error tracking:
- mss0: mean squared value of non-missing entries
- mssold: previous mean squared error
- mss: current mean squared error
Convergence criterion:
- Relative error = (mssold - mss) / mss0
- Stop when relative error < 1e-7
- Scaling ensures convergence is independent of data magnitude
Step 2 (iterative loop):
- 1. Approximate matrix using fit.svd() → Xapp
- 1. Update missing values in Xhat using Xapp
- 1. Compute new error and relative error
Implemented using a while() loop

while(rel_err > thresh) {
    iter <- iter + 1
    # Step 2(a)
    Xapp <- fit.svd(Xhat, M = 1)
    # Step 2(b)
    Xhat[ismiss] <- Xapp[ismiss]
    # Step 2(c)
    mss <- mean(((Xna - Xapp)[!ismiss])^2)
    rel_err <- (mssold - mss) / mss0
    mssold <- mss
    cat("Iter:", iter, "MSS:", mss,
      "Rel. Err:", rel_err, "\n")
    }

Iter: 1 MSS: 0.3821695 Rel. Err: 0.6194004 
Iter: 2 MSS: 0.3705046 Rel. Err: 0.01161265 
Iter: 3 MSS: 0.3692779 Rel. Err: 0.001221144 
Iter: 4 MSS: 0.3691229 Rel. Err: 0.0001543015 
Iter: 5 MSS: 0.3691008 Rel. Err: 2.199233e-05 
Iter: 6 MSS: 0.3690974 Rel. Err: 3.376005e-06 
Iter: 7 MSS: 0.3690969 Rel. Err: 5.465067e-07 
Iter: 8 MSS: 0.3690968 Rel. Err: 9.253082e-08

Algorithm converges after ~8 iterations
Relative error falls below threshold (1e-7)
Final mean squared error ≈ 0.369 (on non-missing entries)
Evaluate performance:
- Compute correlation between imputed values and true values

cor(Xapp[ismiss], X[ismiss])

[1] 0.6535043

Algorithm 12.1 implemented manually for learning purposes
For real applications, use softImpute package (CRAN)
Provides efficient and scalable matrix completion methods