Lab 1

Introduction to Statistical Learning - PISE

Author
Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Introduction to R

  • In this lab we learn basic R commands by running them.
  • The best way to learn R is to try the commands yourself.
  • Download R: http://cran.r-project.org/
  • Recommended IDE: RStudio (desktop or cloud version)
    • http://rstudio.com/

Basic Commands

  • R uses functions to perform operations
  • General syntax: funcname(input1, input2, ...)
  • Inputs are called arguments
  • A function can have any number of arguments
  • Example: c() (concatenate) creates a vector
  • Values inside c() are joined into one vector
  • Assign the result to an object (e.g., x)
  • Typing x prints the vector
x <- c(1, 3, 2, 5)
x
[1] 1 3 2 5
  • > is just the R prompt (not part of the command)
  • It indicates that R is ready for input
  • You can also assign values using = instead of <-
x = c(1, 6, 2)
x
[1] 1 6 2
y = c(1, 4, 3)
  • Use the up arrow to recall and edit previous commands

  • Type ?funcname to open the help page for a function

  • R adds vectors element-wise

  • Vectors should have the same length

  • Check length using length()

length(x)
[1] 3
length(y)
[1] 3
x + y
[1]  2 10  5
  • ls() lists all objects currently in memory
  • rm() removes objects you no longer need
ls()
[1] "x" "y"
rm(x, y)
ls()
character(0)
  • Remove all objects at once with rm(list = ls())
rm(list = ls())
  • matrix() creates a matrix of numbers
  • Use ?matrix to read the documentation
?matrix
  • matrix() has several arguments
  • Main ones: data, nrow, ncol
  • Create a simple matrix using these three inputs
x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4
  • Argument names (data=, nrow=, ncol=) can be omitted
  • Order matters if names are not specified
x <- matrix(c(1, 2, 3, 4), 2, 2)
  • Omitting argument names gives the same result (if order is correct)
  • Specifying names improves clarity and avoids mistakes
  • By default, matrix() fills entries by column
  • Use byrow = TRUE to fill entries by row
matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
  • If not assigned, the matrix is printed but not saved
  • sqrt() applies element-wise to vectors or matrices
  • x^2 raises each element of x to the power 2
  • Any power is allowed (including fractional or negative)
sqrt(x)
         [,1]     [,2]
[1,] 1.000000 1.732051
[2,] 1.414214 2.000000
x^2
     [,1] [,2]
[1,]    1    9
[2,]    4   16
  • rnorm(n) generates n random normal variables
  • Each call produces different values
  • Create two related vectors x and y
  • Use cor() to compute their correlation
x <- rnorm(50)
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x, y)
[1] 0.9920188
  • By default, rnorm() generates standard normal variables
    • Mean = 0
    • Standard deviation = 1
  • Modify using mean and sd arguments
  • Use set.seed(integer) for reproducibility
  • The seed ensures the same random numbers are generated
set.seed(1303)
rnorm(50)
 [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
 [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
[11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
[16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
[21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
[26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
[31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
[36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
[41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
[46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557
  • Use set.seed() to ensure reproducible results

  • Different R versions may produce small differences

  • mean() computes the sample mean

  • var() computes the variance

  • sqrt(var(x)) gives the standard deviation

  • Alternatively, use sd() directly

set.seed(3)
y <- rnorm(100)
mean(y)
[1] 0.01103557
var(y)
[1] 0.7328675
sqrt(var(y))
[1] 0.8560768
sd(y)
[1] 0.8560768

Graphics

  • plot() is the main function for basic graphics in R
  • plot(x, y) creates a scatterplot
  • Additional arguments customize the plot
    • Example: xlab sets the x-axis label
  • Use ?plot to see all available options
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

plot(x, y, xlab = "this is the x-axis",
    ylab = "this is the y-axis",
    main = "Plot of X vs Y")

  • Save plots by opening a graphics device
  • Use pdf() to create a PDF file
  • Use jpeg() to create a JPEG file
  • The function depends on the desired file format
pdf("Figure.pdf")
plot(x, y, col = "green")
dev.off()
quartz_off_screen 
                2 
  • dev.off() closes the graphics device (finish saving the plot)

  • Alternatively, copy the plot window and paste into another document

  • seq() generates sequences of numbers

  • seq(a, b) creates integers from a to b

  • seq(0, 1, length = 10) creates 10 equally spaced values

  • 3:11 is shorthand for seq(3, 11)

x <- seq(1, 10)
x
 [1]  1  2  3  4  5  6  7  8  9 10
x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(-pi, pi, length = 50)
  • contour() creates a contour plot (3D surface representation)

  • Similar to a topographical map

  • Required inputs:

    • Vector of x values
    • Vector of y values
    • Matrix of z values (for each (x, y) pair)
  • Additional arguments allow customization

  • Use ?contour for documentation

y <- x
f <- outer(x, y, function(x, y) cos(y) / (1 + x^2))
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = T)

fa <- (f - t(f)) / 2
contour(x, y, fa, nlevels = 15)

  • image() produces a color-coded plot (heatmap)

  • Colors represent the z values

  • Often used for temperature-style maps

  • persp() creates a 3D surface plot

  • theta and phi control the viewing angles

image(x, y, fa)

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

Indexing Data

  • Often we need to examine part of a data set
  • Suppose the data are stored in a matrix A
A <- matrix(1:16, 4, 4)
A
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
  • A[2, 3] selects the element in row 2, column 3
  • First index = row
  • Second index = column
A[2, 3]
[1] 10
  • Use vectors or ranges to select multiple rows and/or columns
A[c(1, 3), c(2, 4)]
     [,1] [,2]
[1,]    5   13
[2,]    7   15
A[1:3, 2:4]
     [,1] [,2] [,3]
[1,]    5    9   13
[2,]    6   10   14
[3,]    7   11   15
A[1:2, ]
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
A[, 1:2]
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
  • Leaving one index empty selects all rows or all columns
  • A[1:2, ] → all columns
  • A[, 1:2] → all rows
  • A single row or column is treated as a vector
A[1, ]
[1]  1  5  9 13
  • Use a negative index (-) to exclude rows or columns
  • A[-1, ] removes row 1
  • A[, -1] removes column 1
A[-c(1, 3), ]
     [,1] [,2] [,3] [,4]
[1,]    2    6   10   14
[2,]    4    8   12   16
A[-c(1, 3), -c(1, 3, 4)]
[1] 6 8
  • dim() returns the dimensions of a matrix
  • Output format: (number of rows, number of columns)
dim(A)
[1] 4 4

Loading Data

  • First step in analysis: import the data

  • Use read.table() to load text files

  • Use write.table() to export data

  • Check the working directory before loading files

  • Load the Auto data set from Auto.data

  • Data are stored as a data frame

  • Use View() to inspect the data

  • Use head() to display the first rows

Auto <- read.table("../data/Auto.data")
View(Auto)
head(Auto)
    V1        V2           V3         V4     V5           V6   V7     V8
1  mpg cylinders displacement horsepower weight acceleration year origin
2 18.0         8        307.0      130.0  3504.         12.0   70      1
3 15.0         8        350.0      165.0  3693.         11.5   70      1
4 18.0         8        318.0      150.0  3436.         11.0   70      1
5 16.0         8        304.0      150.0  3433.         12.0   70      1
6 17.0         8        302.0      140.0  3449.         10.5   70      1
                         V9
1                      name
2 chevrolet chevelle malibu
3         buick skylark 320
4        plymouth satellite
5             amc rebel sst
6               ford torino
  • Auto.data is a plain text file

  • It can be opened with a text editor or Excel before importing

  • Data were loaded incorrectly because:

    • R treated variable names as data
    • Missing values are coded as ?
  • Use header = TRUE to specify that the first row contains variable names

  • Use na.strings = "?" to define missing values

  • Missing values are common in real data sets

Auto <- read.table("../data/Auto.data", header = T, na.strings = "?", stringsAsFactors = T)
View(Auto)
  • stringsAsFactors = TRUE converts character variables into factors

  • Each distinct string becomes a separate level

  • To import Excel data:

    • Save as CSV (comma-separated values)
    • Use read.csv() in R
Auto <- read.csv("../data/Auto.csv", na.strings = "?", stringsAsFactors = T)
View(Auto)
dim(Auto)
[1] 397   9
Auto[1:4, ]
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
  • dim() returns the data dimensions
    • 397 observations (rows)
    • 9 variables (columns)
  • Several methods exist to handle missing data
  • Here, only 5 rows contain missing values
  • Use na.omit() to remove those rows
Auto <- na.omit(Auto)
dim(Auto)
[1] 392   9
  • After loading the data correctly
  • Use names() to view the variable names
names(Auto)
[1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
[6] "acceleration" "year"         "origin"       "name"        

Additional Graphical and Numerical Summaries

  • Use plot() to create scatterplots of quantitative variables
  • Typing variable names alone gives an error
  • R must be told which data set contains the variables
plot(cylinders, mpg)
  • Access variables using dataset$variable
  • Example: Auto$mpg
  • Alternatively, use attach(Auto) to access variables directly
plot(Auto$cylinders, Auto$mpg)

attach(Auto)
plot(cylinders, mpg)

  • cylinders is stored as numeric → treated as quantitative
  • Since it has few distinct values, it can be treated as qualitative
  • Use as.factor() to convert numeric to categorical
cylinders <- as.factor(cylinders)
  • If the x-axis variable is qualitative, plot() produces a boxplot
  • Boxplots summarize distributions by category
  • Additional arguments can customize the plot
plot(cylinders, mpg)

plot(cylinders, mpg, col = "red")

plot(cylinders, mpg, col = "red", varwidth = T)

plot(cylinders, mpg, col = "red", varwidth = T,
    horizontal = T)

plot(cylinders, mpg, col = "red", varwidth = T,
    xlab = "cylinders", ylab = "MPG")

  • Use hist() to create a histogram
  • col = 2 is equivalent to col = "red"
hist(mpg)

hist(mpg, col = 2)

hist(mpg, col = 2, breaks = 15)

  • pairs() creates a scatterplot matrix
  • Shows scatterplots for all pairs of variables
  • Can specify a subset of variables
pairs(Auto)

pairs(
    ~ mpg + displacement + horsepower + weight + acceleration,
    data = Auto
  )

  • identify() works with plot() for interactive labeling
  • Arguments: x-variable, y-variable, label variable
  • Click points on the plot to display labels
  • Press Esc to stop
  • Printed numbers correspond to row indices
plot(horsepower, mpg)
identify(horsepower, mpg, name)

integer(0)
  • summary() provides a numerical summary of each variable
  • Output depends on the variable type (numeric or factor)
summary(Auto)
      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                               
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365  
  • For qualitative variables, summary() shows counts per category
  • You can also summarize a single variable (e.g., summary(mpg))
summary(mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   17.00   22.75   23.45   29.00   46.60 
  • Use q() to quit R
  • Option to save the current workspace on exit
  • Save command history with savehistory()
  • Reload history with loadhistory()