Lab 1

Introduction to Statistical Learning - PISE

Author

Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Introduction to R

In this lab we learn basic R commands by running them.
The best way to learn R is to try the commands yourself.
Download R: http://cran.r-project.org/
Recommended IDE: RStudio (desktop or cloud version)
- http://rstudio.com/

Basic Commands

R uses functions to perform operations
General syntax: funcname(input1, input2, ...)
Inputs are called arguments
A function can have any number of arguments
Example: c() (concatenate) creates a vector
Values inside c() are joined into one vector
Assign the result to an object (e.g., x)
Typing x prints the vector

x <- c(1, 3, 2, 5)
x

[1] 1 3 2 5

> is just the R prompt (not part of the command)
It indicates that R is ready for input
You can also assign values using = instead of <-

x = c(1, 6, 2)
x

[1] 1 6 2

y = c(1, 4, 3)

Use the up arrow to recall and edit previous commands
Type ?funcname to open the help page for a function
R sums vectors element-wise. Vectors should have the same length
Check length using length()

length(x)

[1] 3

length(y)

[1] 3

x + y

[1]  2 10  5

ls() lists all objects currently in memory
rm() removes objects you no longer need

ls()

[1] "x" "y"

rm(x, y)
ls()

character(0)

Remove all objects at once with rm(list = ls())

rm(list = ls())

matrix() creates a matrix of numbers
Use ?matrix to read the documentation

?matrix

matrix() has several arguments
Main ones: data, nrow, ncol
Create a simple matrix using these three inputs

x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Argument names (data=, nrow=, ncol=) can be omitted
Order matters if names are not specified

x <- matrix(c(1, 2, 3, 4), 2, 2)

Omitting argument names gives the same result (if order is correct)
Specifying names improves clarity and avoids mistakes
By default, matrix() fills entries by column
Use byrow = TRUE to fill entries by row

matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)

     [,1] [,2]
[1,]    1    2
[2,]    3    4

If not assigned, the matrix is printed but not saved
sqrt() applies element-wise to vectors or matrices
x^2 raises each element of x to the power 2
Any power is allowed (including fractional or negative)

sqrt(x)

         [,1]     [,2]
[1,] 1.000000 1.732051
[2,] 1.414214 2.000000

x^2

     [,1] [,2]
[1,]    1    9
[2,]    4   16

rnorm(n) generates n random normal variables
Each call produces different values
Create two related vectors x and y
Use cor() to compute their correlation

x <- rnorm(50)
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x, y)

[1] 0.9960593

By default, rnorm() generates standard normal variables
- Mean = 0
- Standard deviation = 1
Modify using mean and sd arguments
Use set.seed(integer) for reproducibility
The seed ensures the same random numbers are generated

set.seed(1303)
rnorm(50)

 [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
 [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
[11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
[16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
[21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
[26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
[31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
[36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
[41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
[46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

Use set.seed() to ensure reproducible results
Different R versions may produce small differences
mean() computes the sample mean
var() computes the variance
sqrt(var(x)) gives the standard deviation
Alternatively, use sd() directly

set.seed(3)
y <- rnorm(100)
mean(y)

[1] 0.01103557

var(y)

[1] 0.7328675

sqrt(var(y))

[1] 0.8560768

sd(y)

[1] 0.8560768

Graphics

plot() is the main function for basic graphics in R
plot(x, y) creates a scatterplot
Additional arguments customize the plot
- Example: xlab sets the x-axis label
Use ?plot to see all available options

x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

plot(x, y, xlab = "this is the x-axis",
    ylab = "this is the y-axis",
    main = "Plot of X vs Y")

Save plots by opening a graphics device
Use pdf() to create a PDF file
Use jpeg() to create a JPEG file
The function depends on the desired file format

pdf("Figure.pdf")
plot(x, y, col = "green")
dev.off()

quartz_off_screen 
                2

dev.off() closes the graphics device (finish saving the plot)
Alternatively, copy the plot window and paste into another document
seq() generates sequences of numbers
seq(a, b) creates integers from a to b
seq(0, 1, length = 10) creates 10 equally spaced values
3:11 is shorthand for seq(3, 11)

x <- seq(1, 10)
x

 [1]  1  2  3  4  5  6  7  8  9 10

x <- 1:10
x

 [1]  1  2  3  4  5  6  7  8  9 10

x <- seq(-pi, pi, length = 50)

contour() creates a contour plot (3D surface representation)
Similar to a topographical map
Required inputs:
- Vector of x values
- Vector of y values
- Matrix of z values (for each (x, y) pair)
Additional arguments allow customization
Use ?contour for documentation

y <- x
f <- outer(x, y, function(x, y) cos(y) / (1 + x^2))
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = T)

fa <- (f - t(f)) / 2
contour(x, y, fa, nlevels = 15)

image() produces a color-coded plot (heatmap)
Colors represent the z values
Often used for temperature-style maps
persp() creates a 3D surface plot
theta and phi control the viewing angles

image(x, y, fa)

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

Indexing Data

Often we need to examine part of a data set
Suppose the data are stored in a matrix A

A <- matrix(1:16, 4, 4)
A

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

A[2, 3] selects the element in row 2, column 3
First index = row
Second index = column

A[2, 3]

[1] 10

Use vectors or ranges to select multiple rows and/or columns

A[c(1, 3), c(2, 4)]

     [,1] [,2]
[1,]    5   13
[2,]    7   15

A[1:3, 2:4]

     [,1] [,2] [,3]
[1,]    5    9   13
[2,]    6   10   14
[3,]    7   11   15

A[1:2, ]

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14

A[, 1:2]

     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

Leaving one index empty selects all rows or all columns
A[1:2, ] → all columns
A[, 1:2] → all rows
A single row or column is treated as a vector

A[1, ]

[1]  1  5  9 13

Use a negative index (-) to exclude rows or columns
A[-1, ] removes row 1
A[, -1] removes column 1

A[-c(1, 3), ]

     [,1] [,2] [,3] [,4]
[1,]    2    6   10   14
[2,]    4    8   12   16

A[-c(1, 3), -c(1, 3, 4)]

[1] 6 8

dim() returns the dimensions of a matrix
Output format: (number of rows, number of columns)

dim(A)

[1] 4 4

Loading Data

First step in analysis: import the data
Use read.table() to load text files
Use write.table() to export data
Check the working directory before loading files
Load the Auto data set from Auto.data
Data are stored as a data frame
Use View() to inspect the data
Use head() to display the first rows

Auto <- read.table("../data/Auto.data")
View(Auto)
head(Auto)

    V1        V2           V3         V4     V5           V6   V7     V8
1  mpg cylinders displacement horsepower weight acceleration year origin
2 18.0         8        307.0      130.0  3504.         12.0   70      1
3 15.0         8        350.0      165.0  3693.         11.5   70      1
4 18.0         8        318.0      150.0  3436.         11.0   70      1
5 16.0         8        304.0      150.0  3433.         12.0   70      1
6 17.0         8        302.0      140.0  3449.         10.5   70      1
                         V9
1                      name
2 chevrolet chevelle malibu
3         buick skylark 320
4        plymouth satellite
5             amc rebel sst
6               ford torino

Auto.data is a plain text file
It can be opened with a text editor or Excel before importing
Data were loaded incorrectly because:
- R treated variable names as data
- Missing values are coded as ?
Use header = TRUE to specify that the first row contains variable names
Use na.strings = "?" to define missing values
Missing values are common in real data sets

Auto <- read.table("../data/Auto.data", header = T, na.strings = "?", stringsAsFactors = T)
View(Auto)

stringsAsFactors = TRUE converts character variables into factors
Each distinct string becomes a separate level
To import Excel data:
- Save as CSV (comma-separated values)
- Use read.csv() in R

Auto <- read.csv("../data/Auto.csv", na.strings = "?", stringsAsFactors = T)
View(Auto)
dim(Auto)

[1] 397   9

Auto[1:4, ]

  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst

dim() returns the data dimensions
- 397 observations (rows)
- 9 variables (columns)
Several methods exist to handle missing data
Here, only 5 rows contain missing values
Use na.omit() to remove those rows

Auto <- na.omit(Auto)
dim(Auto)

[1] 392   9

After loading the data correctly
Use names() to view the variable names

names(Auto)

[1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
[6] "acceleration" "year"         "origin"       "name"

Additional Graphical and Numerical Summaries

Use plot() to create scatterplots of quantitative variables
Typing variable names alone gives an error
R must be told which data set contains the variables

plot(cylinders, mpg)

Access variables using dataset$variable
Example: Auto$mpg
Alternatively, use attach(Auto) to access variables directly

plot(Auto$cylinders, Auto$mpg)

attach(Auto)
plot(cylinders, mpg)

cylinders is stored as numeric → treated as quantitative
Since it has few distinct values, it can be treated as qualitative
Use as.factor() to convert numeric to categorical

cylinders <- as.factor(cylinders)

If the x-axis variable is qualitative, plot() produces a boxplot
Boxplots summarize distributions by category
Additional arguments can customize the plot

plot(cylinders, mpg)

plot(cylinders, mpg, col = "red")

plot(cylinders, mpg, col = "red", varwidth = T)

plot(cylinders, mpg, col = "red", varwidth = T,
    horizontal = T)

plot(cylinders, mpg, col = "red", varwidth = T,
    xlab = "cylinders", ylab = "MPG")

Use hist() to create a histogram
col = 2 is equivalent to col = "red"

hist(mpg)

hist(mpg, col = 2)

hist(mpg, col = 2, breaks = 15)

pairs() creates a scatterplot matrix
Shows scatterplots for all pairs of variables
Can specify a subset of variables

pairs(Auto)

pairs(
    ~ mpg + displacement + horsepower + weight + acceleration,
    data = Auto
  )

identify() works with plot() for interactive labeling
Arguments: x-variable, y-variable, label variable
Click points on the plot to display labels
Press Esc to stop
Printed numbers correspond to row indices

plot(horsepower, mpg)
identify(horsepower, mpg, name)

integer(0)

summary() provides a numerical summary of each variable
Output depends on the variable type (numeric or factor)

summary(Auto)

      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                               
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365

For qualitative variables, summary() shows counts per category
You can also summarize a single variable (e.g., summary(mpg))

summary(mpg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   17.00   22.75   23.45   29.00   46.60

Use q() to quit R
Option to save the current workspace on exit
Save command history with savehistory()
Reload history with loadhistory()