Lab 5

Introduction to Statistical Learning - PISE

Author
Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Unsupervised Learning

Principal Components Analysis

  • PCA is performed on the USArrests dataset (included in base R)
  • Rows represent the 50 U.S. states
  • States are listed in alphabetical order
states <- row.names(USArrests)
states
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"       
  • Columns of the dataset correspond to the four variables
names(USArrests)
[1] "Murder"   "Assault"  "UrbanPop" "Rape"    
  • Initial data exploration is performed
  • Variables have very different mean values
apply(USArrests, 2, mean)
  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232 
  • apply() applies a function (e.g., mean()) to rows or columns of a dataset
  • Second argument specifies dimension (1 = rows, 2 = columns)
  • Data shows large scale differences: rapes ≈ 3× murders, assaults ≈ 8× rapes
  • apply() can also be used to compute variances of variables
apply(USArrests, 2, var)
    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916 
  • Variables have very different variances
  • UrbanPop (percentage) is not comparable to crime rates (per 100,000 people)
  • Without scaling, PCA would be dominated by Assault (largest mean and variance)
  • Therefore, variables must be standardized (mean = 0, sd = 1) before PCA
  • PCA is performed using the prcomp() function in R
pr.out <- prcomp(USArrests, scale = TRUE)
  • prcomp() centers variables to mean 0 by default
  • scale = TRUE standardizes variables to standard deviation 1
  • Output of prcomp() includes several useful components
names(pr.out)
[1] "sdev"     "rotation" "center"   "scale"    "x"       
  • center: means of variables used for centering
  • scale: standard deviations used for scaling before PCA
pr.out$center
  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232 
pr.out$scale
   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385 
  • rotation matrix contains the principal component loadings
  • Each column represents a principal component loading vector
  • Multiplying the data matrix X by rotation gives the principal component scores
  • Called “rotation” because it represents a rotation of the original coordinate system
pr.out$rotation
                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432
  • There are 4 principal components
  • In general, number of informative PCs = min(n − 1, p)
  • prcomp() directly provides principal component scores
  • pr.out$x is a 50 × 4 matrix of scores
  • Each column corresponds to a principal component score vector
dim(pr.out$x)
[1] 50  4
  • The first two principal components can be visualized using a plot
biplot(pr.out, scale = 0)

  • In biplot(), scale = 0 scales arrows to represent loadings
  • Different scale values change interpretation of the biplot
  • PCA results are only unique up to a sign change
  • Plots may appear as mirror images due to sign ambiguity
  • Sign flipping can reproduce equivalent visualizations
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale = 0)

  • prcomp() outputs the standard deviation of each principal component
  • These can be accessed via pr.out$sdev
pr.out$sdev
[1] 1.5748783 0.9948694 0.5971291 0.4164494
  • Variance explained by each principal component is obtained by squaring the standard deviations
  • Computed as: pr.var <- pr.out$sdev^2
pr.var <- pr.out$sdev^2
pr.var
[1] 2.4802416 0.9897652 0.3565632 0.1734301
  • Proportion of variance explained (PVE) is computed by dividing each component’s variance by total variance
  • Computed as: pve <- pr.var / sum(pr.var)
pve <- pr.var / sum(pr.var)
pve
[1] 0.62006039 0.24744129 0.08914080 0.04335752
  • First principal component explains ~62.0% of the variance
  • Second principal component explains ~24.7%
  • Remaining components explain smaller proportions
  • PVE and cumulative PVE can be visualized with plots
par(mfrow = c(1, 2))
plot(pve, xlab = "Principal Component",
    ylab = "Proportion of Variance Explained", ylim = c(0, 1),
    type = "b")
plot(cumsum(pve), xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")

  • Results are visualized in a plot (Figure 12.3)
  • cumsum() computes the cumulative sum of a numeric vector
  • Used to calculate cumulative proportion of variance explained
a <- c(1, 2, 8, -3)
cumsum(a)
[1]  1  3 11  8

Matrix Completion

  • Recreate analysis on USArrests dataset
  • Convert data frame to matrix
  • Center and scale each column (mean = 0, variance = 1)
X <- data.matrix(scale(USArrests))
pcob <- prcomp(X)
summary(pcob)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
  • First principal component explains ~62% of the variance
  • PCA can be formulated as an optimization problem on centered data
  • Computing the first M principal components solves this problem
  • Singular Value Decomposition (SVD) is a general method to compute PCA
sX <- svd(X)
names(sX)
[1] "d" "u" "v"
round(sX$v, 3)
       [,1]   [,2]   [,3]   [,4]
[1,] -0.536 -0.418  0.341  0.649
[2,] -0.583 -0.188  0.268 -0.743
[3,] -0.278  0.873  0.378  0.134
[4,] -0.543  0.167 -0.818  0.089
  • svd() returns three components: u, d, and v
  • Matrix v corresponds to PCA loadings (up to a sign flip)
pcob$rotation
                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432
  • Matrix u contains standardized principal component scores
  • Vector d contains singular values (related to standard deviations)
  • PCA scores can be recovered from SVD output
  • These scores are identical to those from prcomp()
t(sX$d * t(sX$u))
             [,1]        [,2]        [,3]         [,4]
 [1,] -0.97566045 -1.12200121  0.43980366  0.154696581
 [2,] -1.93053788 -1.06242692 -2.01950027 -0.434175454
 [3,] -1.74544285  0.73845954 -0.05423025 -0.826264240
 [4,]  0.13999894 -1.10854226 -0.11342217 -0.180973554
 [5,] -2.49861285  1.52742672 -0.59254100 -0.338559240
 [6,] -1.49934074  0.97762966 -1.08400162  0.001450164
 [7,]  1.34499236  1.07798362  0.63679250 -0.117278736
 [8,] -0.04722981  0.32208890  0.71141032 -0.873113315
 [9,] -2.98275967 -0.03883425  0.57103206 -0.095317042
[10,] -1.62280742 -1.26608838  0.33901818  1.065974459
[11,]  0.90348448  1.55467609 -0.05027151  0.893733198
[12,]  1.62331903 -0.20885253 -0.25719021 -0.494087852
[13,] -1.36505197  0.67498834  0.67068647 -0.120794916
[14,]  0.50038122  0.15003926 -0.22576277  0.420397595
[15,]  2.23099579  0.10300828 -0.16291036  0.017379470
[16,]  0.78887206  0.26744941 -0.02529648  0.204421034
[17,]  0.74331256 -0.94880748  0.02808429  0.663817237
[18,] -1.54909076 -0.86230011  0.77560598  0.450157791
[19,]  2.37274014 -0.37260865  0.06502225 -0.327138529
[20,] -1.74564663 -0.42335704  0.15566968 -0.553450589
[21,]  0.48128007  1.45967706  0.60337172 -0.177793902
[22,] -2.08725025  0.15383500 -0.38100046  0.101343128
[23,]  1.67566951  0.62590670 -0.15153200  0.066640316
[24,] -0.98647919 -2.36973712  0.73336290  0.213342049
[25,] -0.68978426  0.26070794 -0.37365033  0.223554811
[26,]  1.17353751 -0.53147851 -0.24440796  0.122498555
[27,]  1.25291625  0.19200440 -0.17380930  0.015733156
[28,] -2.84550542  0.76780502 -1.15168793  0.311354436
[29,]  2.35995585  0.01790055 -0.03648498 -0.032804291
[30,] -0.17974128  1.43493745  0.75677041  0.240936580
[31,] -1.96012351 -0.14141308 -0.18184598 -0.336121113
[32,] -1.66566662  0.81491072  0.63661186 -0.013348844
[33,] -1.11208808 -2.20561081  0.85489245 -0.944789648
[34,]  2.96215223 -0.59309738 -0.29824930 -0.251434626
[35,]  0.22369436  0.73477837  0.03082616  0.469152817
[36,]  0.30864928  0.28496113  0.01515592  0.010228476
[37,] -0.05852787  0.53596999 -0.93038718 -0.235390872
[38,]  0.87948680  0.56536050  0.39660218  0.355452378
[39,]  0.85509072  1.47698328  1.35617705 -0.607402746
[40,] -1.30744986 -1.91397297  0.29751723 -0.130145378
[41,]  1.96779669 -0.81506822 -0.38538073 -0.108470512
[42,] -0.98969377 -0.85160534 -0.18619262  0.646302674
[43,] -1.34151838  0.40833518  0.48712332  0.636731051
[44,]  0.54503180  1.45671524 -0.29077592 -0.081486749
[45,]  2.77325613 -1.38819435 -0.83280797 -0.143433697
[46,]  0.09536670 -0.19772785 -0.01159482  0.209246429
[47,]  0.21472339  0.96037394 -0.61859067 -0.218628161
[48,]  2.08739306 -1.41052627 -0.10372163  0.130583080
[49,]  2.05881199  0.60512507  0.13746933  0.182253407
[50,]  0.62310061 -0.31778662  0.23824049 -0.164976866
pcob$x
                       PC1         PC2         PC3          PC4
Alabama        -0.97566045 -1.12200121  0.43980366  0.154696581
Alaska         -1.93053788 -1.06242692 -2.01950027 -0.434175454
Arizona        -1.74544285  0.73845954 -0.05423025 -0.826264240
Arkansas        0.13999894 -1.10854226 -0.11342217 -0.180973554
California     -2.49861285  1.52742672 -0.59254100 -0.338559240
Colorado       -1.49934074  0.97762966 -1.08400162  0.001450164
Connecticut     1.34499236  1.07798362  0.63679250 -0.117278736
Delaware       -0.04722981  0.32208890  0.71141032 -0.873113315
Florida        -2.98275967 -0.03883425  0.57103206 -0.095317042
Georgia        -1.62280742 -1.26608838  0.33901818  1.065974459
Hawaii          0.90348448  1.55467609 -0.05027151  0.893733198
Idaho           1.62331903 -0.20885253 -0.25719021 -0.494087852
Illinois       -1.36505197  0.67498834  0.67068647 -0.120794916
Indiana         0.50038122  0.15003926 -0.22576277  0.420397595
Iowa            2.23099579  0.10300828 -0.16291036  0.017379470
Kansas          0.78887206  0.26744941 -0.02529648  0.204421034
Kentucky        0.74331256 -0.94880748  0.02808429  0.663817237
Louisiana      -1.54909076 -0.86230011  0.77560598  0.450157791
Maine           2.37274014 -0.37260865  0.06502225 -0.327138529
Maryland       -1.74564663 -0.42335704  0.15566968 -0.553450589
Massachusetts   0.48128007  1.45967706  0.60337172 -0.177793902
Michigan       -2.08725025  0.15383500 -0.38100046  0.101343128
Minnesota       1.67566951  0.62590670 -0.15153200  0.066640316
Mississippi    -0.98647919 -2.36973712  0.73336290  0.213342049
Missouri       -0.68978426  0.26070794 -0.37365033  0.223554811
Montana         1.17353751 -0.53147851 -0.24440796  0.122498555
Nebraska        1.25291625  0.19200440 -0.17380930  0.015733156
Nevada         -2.84550542  0.76780502 -1.15168793  0.311354436
New Hampshire   2.35995585  0.01790055 -0.03648498 -0.032804291
New Jersey     -0.17974128  1.43493745  0.75677041  0.240936580
New Mexico     -1.96012351 -0.14141308 -0.18184598 -0.336121113
New York       -1.66566662  0.81491072  0.63661186 -0.013348844
North Carolina -1.11208808 -2.20561081  0.85489245 -0.944789648
North Dakota    2.96215223 -0.59309738 -0.29824930 -0.251434626
Ohio            0.22369436  0.73477837  0.03082616  0.469152817
Oklahoma        0.30864928  0.28496113  0.01515592  0.010228476
Oregon         -0.05852787  0.53596999 -0.93038718 -0.235390872
Pennsylvania    0.87948680  0.56536050  0.39660218  0.355452378
Rhode Island    0.85509072  1.47698328  1.35617705 -0.607402746
South Carolina -1.30744986 -1.91397297  0.29751723 -0.130145378
South Dakota    1.96779669 -0.81506822 -0.38538073 -0.108470512
Tennessee      -0.98969377 -0.85160534 -0.18619262  0.646302674
Texas          -1.34151838  0.40833518  0.48712332  0.636731051
Utah            0.54503180  1.45671524 -0.29077592 -0.081486749
Vermont         2.77325613 -1.38819435 -0.83280797 -0.143433697
Virginia        0.09536670 -0.19772785 -0.01159482  0.209246429
Washington      0.21472339  0.96037394 -0.61859067 -0.218628161
West Virginia   2.08739306 -1.41052627 -0.10372163  0.130583080
Wisconsin       2.05881199  0.60512507  0.13746933  0.182253407
Wyoming         0.62310061 -0.31778662  0.23824049 -0.164976866
  • Although prcomp() could be used, svd() is used here for illustration
  • Randomly remove 20 entries from the 50 × 4 data matrix
  • Procedure:
    • Select 20 rows (states) at random
    • For each selected row, randomly remove one of the four variables
  • Ensures each row still has at least three observed values
nomit <- 20
set.seed(15)
ina <- sample(seq(50), nomit)
inb <- sample(1:4, nomit, replace = TRUE)
Xna <- X
index.na <- cbind(ina, inb)
Xna[index.na] <- NA
  • ina: indices of rows (states) with missing values

  • inb: indices of columns (features) with missing values

  • index.na: matrix combining row and column indices for missing entries

  • Demonstrates indexing a matrix using a matrix of indices

  • Implement Algorithm 12.1 for matrix completion

  • Define a function to approximate the matrix using svd()

  • This function is used in Step 2 of the algorithm

  • svd() is used instead of prcomp() for illustration purposes

fit.svd <- function(X, M = 1) {
   svdob <- svd(X)
   with(svdob,
       u[, 1:M, drop = FALSE] %*%
       (d[1:M] * t(v[, 1:M, drop = FALSE]))
     )
}
  • Explicit return() is not needed in R; last computed value is returned automatically
  • with() is used to simplify access to elements of svdob
  • Alternative: directly reference components (e.g., svdob$u, svdob$d, svdob$v) inside the function
svdob$u[, 1:M, drop = FALSE] %*%
  (svdob$d[1:M] * t(svdob$v[, 1:M, drop = FALSE]))
  • Step 1: initialize Xhat (≈ \tilde{\mathbf{X}})
  • Replace missing values with column means (computed from non-missing entries)
Xhat <- Xna
xbar <- colMeans(Xna, na.rm = TRUE)
Xhat[index.na] <- xbar[inb]
  • Prepare variables to track progress of the iterative algorithm
  • Initialize error measures and convergence threshold
  • Set up structures to monitor changes across iterations
thresh <- 1e-7
rel_err <- 1
iter <- 0
ismiss <- is.na(Xna)
mssold <- mean((scale(Xna, xbar, FALSE)[!ismiss])^2)
mss0 <- mean(Xna[!ismiss]^2)
  • ismiss: logical matrix indicating missing entries (TRUE = missing)

  • Allows separation of missing vs non-missing values

  • Error tracking:

    • mss0: mean squared value of non-missing entries
    • mssold: previous mean squared error
    • mss: current mean squared error
  • Convergence criterion:

    • Relative error = (mssold - mss) / mss0
    • Stop when relative error < 1e-7
    • Scaling ensures convergence is independent of data magnitude
  • Step 2 (iterative loop):

      1. Approximate matrix using fit.svd()Xapp
      1. Update missing values in Xhat using Xapp
      1. Compute new error and relative error
  • Implemented using a while() loop

while(rel_err > thresh) {
    iter <- iter + 1
    # Step 2(a)
    Xapp <- fit.svd(Xhat, M = 1)
    # Step 2(b)
    Xhat[ismiss] <- Xapp[ismiss]
    # Step 2(c)
    mss <- mean(((Xna - Xapp)[!ismiss])^2)
    rel_err <- (mssold - mss) / mss0
    mssold <- mss
    cat("Iter:", iter, "MSS:", mss,
      "Rel. Err:", rel_err, "\n")
    }
Iter: 1 MSS: 0.3821695 Rel. Err: 0.6194004 
Iter: 2 MSS: 0.3705046 Rel. Err: 0.01161265 
Iter: 3 MSS: 0.3692779 Rel. Err: 0.001221144 
Iter: 4 MSS: 0.3691229 Rel. Err: 0.0001543015 
Iter: 5 MSS: 0.3691008 Rel. Err: 2.199233e-05 
Iter: 6 MSS: 0.3690974 Rel. Err: 3.376005e-06 
Iter: 7 MSS: 0.3690969 Rel. Err: 5.465067e-07 
Iter: 8 MSS: 0.3690968 Rel. Err: 9.253082e-08 
  • Algorithm converges after ~8 iterations

  • Relative error falls below threshold (1e-7)

  • Final mean squared error ≈ 0.369 (on non-missing entries)

  • Evaluate performance:

    • Compute correlation between imputed values and true values
cor(Xapp[ismiss], X[ismiss])
[1] 0.6535043
  • Algorithm 12.1 implemented manually for learning purposes
  • For real applications, use softImpute package (CRAN)
  • Provides efficient and scalable matrix completion methods