Introduction

Introduction to Statistical Learning - PISE

Aldo Solari

Ca’ Foscari University of Venice

Statistical Learning Problems

  • Predicting future values

  • Recommender Systems

  • Dimension Reduction

Predicting future values

Predicting the food delivery time

  • Machine learning models are mathematical equations that take inputs, called predictors, and try to estimate some future output value, called outcome.

\underset{outcome}{Y} \leftarrow f(\underset{predictors}{X_1,\ldots,X_p})

  • For example, we want to predict how long it takes to deliver food ordered from a restaurant.

  • The outcome is the time from the initial order (in minutes).

  • There are multiple predictors, including:

    • the distance from the restaurant to the delivery location,
    • the date/time of the order,
    • which items were included in the order.

Food Delivery Time Data

  • The data are tabular, where the 31 variables (1 outcome + 30 predictors) are arranged in columns and the the n=10012 observations in rows:
  time_to_delivery   hour day distance item_01 item_02 item_03 item_04 item_27
1          16.1106 11.899 Thu 5.069421       0       0       2       0       0
2          22.9466 19.230 Tue 5.938465       0       0       0       0       0
3          30.2882 18.374 Fri 3.315240       0       0       0       0       0
4          33.4266 15.836 Thu 9.607760       0       0       0       0       1
5          27.2255 19.619 Fri 4.055537       0       0       0       1       1
6          19.6459 12.952 Sat 5.391289       1       0       0       1       0
  • Note that the predictor values are known. For future data, the outcome is unknown; it is a machine learning model’s job to predict unknown outcome values.

Outcome Y

Predictor X_1

Regression function

  • A machine learning model has a defined mathematical prediction equation, called regression function f(\cdot), defining exactly how the predictors X_1,\ldots,X_n relate to the outcome Y: Y \approx f(X_1,\ldots,X_p)

  • Here is a simple example of regression function: the linear model with a single predictor (the distance X_1) and two unknown parameters \beta_0 and \beta_1 that have been estimated:

\begin{aligned} Y &\approx \hat{\beta}_0 + \hat{\beta}_1 X_1\\ \\ delivery\,\,time &\approx 17.557 + 1.781\,\times \,distance \end{aligned}
  • We could use this equation for new orders:

    • If we had placed an order at the restaurant (i.e., a zero distance) we predict that it would take 17.5 minutes.

    • If we were seven kilometers away, the predicted delivery time is 17.557 + 7\times 1.781 \approx 30 minutes.

Predictor X_2

3D scatter plot

Regression plane

With two predictors, say X_1 and X_2, the linear regression function becomes a plane in the three-dimensional space: Y \approx \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2.

The fitted regression function is delivery\,\,time \approx −11.17 + 1.76 \times distance + 1.77 \times order\,\,time

If an order were placed at 12:00 with a distance of 7 km, the predicted delivery time would be −11.17 + 1.76 \times 7 + 1.77 \times 12 = 22.4 minutes.

Regression spline (non-parametric)

Flexibility versus Interpretability

Figure 2.7 (ISL). Representation of the tradeoff between flexibility and interpretability across statistical learning methods. In general, as flexibility increases, interpretability decreases.

Regression tree

A different regression function

\begin{align} delivery \,\, time \approx \:&17\times\, I\left(\,order\,\,time < 13 \text{ hours } \right) + \notag \\ \:&22\times\, I\left(\,13\leq \, order\,\,time < 15 \text{ hours } \right) + \notag \\ \:&28\times\, I\left(\,order\,\,time \geq 15 \text{ hours and }distance < 6.4 \text{ kilometers }\right) + \notag \\ \:&36\times\, I\left(\,order\,\,time \geq 15 \text{ hours and }distance \geq 6.4 \text{ kilometers }\right)\notag \end{align}

  • The indicator function I(\cdot) is one if the logical statement is true and zero otherwise.

  • Two predictors (distance X_1 and order time X_2) were used in this case.

Partition of the predictors space (X_1,X_2)

Predictor X_3

Figure 2.2 in Kuhn, M and Johnson, K (2023) Applied Machine Learning for Tabular Data. https://aml4td.org/

Recommender Systems

The Netflix Prize

  • Competition started in October 2006. The data is ratings for 18000 movies by 400000 Netflix customers, each rating between 1 and 5.

  • Data is very sparse - about 98% missing.

  • Objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the data.

  • Netflix’s original algorithm achieved a Root Mean Squared Error (RMSE) of 0.953. The first team to achieve a 10% improvement wins one million dollars.

Recommender Systems

  • Digital streaming services like Netflix and Amazon use data about the content that a customer has viewed in the past, as well as data from other customers, to suggest other content for the customer.

  • In order to suggest a movie that a particular customer might like, Netflix needed a way to impute the missing values of the customer-movie data matrix.

  • Principal Component Analysis (PCA) is at the heart of many recommender systems. Principal components can be used to impute the missing values, through a process known as matrix completion.

Dimension Reduction

Heptathlon data

  • 100m hurdles.

  • high jump.

  • shot.

  • 200m race.

  • long jump.

  • javelin.

  • 800m race.

Results in the women’s heptathlon in the 1988 Olympics held in Seoul are given in the next table (timed events have times in seconds, distances are measured in metres).

hurdles highjump shot run200m longjump javelin run800m score
Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291
John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 6897
Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 6858
Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 6540
Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 6540
Schulz (GDR) 13.75 1.83 13.50 24.65 6.33 42.82 125.79 6411
Fleming (AUS) 13.38 1.80 12.88 23.59 6.37 40.28 132.54 6351
Greiner (USA) 13.55 1.80 14.13 24.48 6.47 38.00 133.65 6297
Lajbnerova (CZE) 13.63 1.83 14.28 24.86 6.11 42.20 136.05 6252
Bouraga (URS) 13.25 1.77 12.62 23.59 6.28 39.06 134.74 6252
Wijnsma (HOL) 13.75 1.86 13.01 25.03 6.34 37.86 131.49 6205
Dimitrova (BUL) 13.24 1.80 12.88 23.59 6.37 40.28 132.54 6171
Scheider (SWI) 13.85 1.86 11.58 24.87 6.05 47.50 134.93 6137
Braun (FRG) 13.71 1.83 13.16 24.78 6.12 44.58 142.82 6109
Ruotsalainen (FIN) 13.79 1.80 12.32 24.61 6.08 45.44 137.06 6101
Yuping (CHN) 13.93 1.86 14.21 25.00 6.40 38.60 146.67 6087
Hagger (GB) 13.47 1.80 12.75 25.47 6.34 35.76 138.48 5975
Brown (USA) 14.07 1.83 12.69 24.83 6.13 44.34 146.43 5972
Mulliner (GB) 14.39 1.71 12.68 24.92 6.10 37.76 138.02 5746
Hautenauve (BEL) 14.04 1.77 11.81 25.61 5.99 35.68 133.90 5734
Kytola (FIN) 14.31 1.77 11.66 25.69 5.75 39.48 133.35 5686
Geremias (BRA) 14.23 1.71 12.95 25.50 5.50 39.64 144.02 5508
Hui-Ing (TAI) 14.85 1.68 10.00 25.23 5.47 39.14 137.30 5290
Jeong-Mi (KOR) 14.53 1.71 10.83 26.61 5.50 39.26 139.17 5289
Launa (PNG) 16.42 1.50 11.78 26.16 4.88 46.38 163.43 4566

Goal

Determine a score to assign to each athlete that summarizes the performances across the seven events in order to obtain the final ranking, that is, to reduce the dimensionality from 7 to 1.

\underset{25 \times 7}{X} \mapsto \color{red}{\underset{25 \times 1}{y}}

\underset{243 \times 220}{X}

Image = data

  • An image (in black and white) can be represented as a data matrix (n rows \times p columns): \underset{n \times p}{X} where the grayscale intensity of each pixel is represented in the corresponding cell of the matrix.

  • Lighter colors are associated with higher values, while darker colors are associated with lower values (in the range [0,1])

             V1        V2        V3        V4        V5        V6        V7
 [1,] 0.5098039 0.5098039 0.5098039 0.5098039 0.5098039 0.5098039 0.5098039
 [2,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [3,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [4,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [5,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [6,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [7,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [8,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [9,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[10,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

Image compression

  • Original image made by 53460 numbers

  • Compressed image made by 4850 numbers

Supervised Versus Unsupervised

The Supervised Learning Problem

  • Outcome measurement Y (also called dependent variable, response, target).

  • Vector of p predictor measurements X=(X_1,X_2,\ldots,X_p) (also called inputs, regressors, covariates, features, independent variables).

  • In the regression problem, Y is quantitative (e.g price, blood pressure).

  • In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).

  • We have training data (x_1, y_1), \ldots , (x_N , y_N ). These are observations (examples, instances) of these measurements.

Objectives

On the basis of the training data we would like to:

  • Accurately predict unseen test cases.

  • Understand which inputs affect the outcome, and how.

  • Assess the quality of our predictions.

Unsupervised learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples.

  • Objective is more fuzzy — find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.

  • Difficult to know how well your are doing.

  • Different from supervised learning, but can be useful as a pre-processing step for supervised learning.

Statistical Learning versus Machine Learning

  • Machine learning arose as a subfield of Artificial Intelligence.

  • Statistical learning arose as a subfield of Statistics.

  • There is much overlap - both fields focus on supervised and unsupervised problems:

  • Machine learning has a greater emphasis on large scale applications and prediction accuracy.

  • Statistical learning emphasizes models and their interpretability, and precision and uncertainty.

Course text

  • The course will cover some of the material in this Springer book (ISLR) published in 2021 (Second Edition).

  • Each chapter ends with an R lab, in which examples are developed.

  • An electronic version of this book is available from https://www.statlearning.com/

Required readings from the textbook and course materials

  • Chapter 1: Introduction

  • Chapter 2: Statistical Learning

    • 2.1 What Is Statistical Learning?
      • 2.1.1 Why Estimate f?
      • 2.1.2 How Do We Estimate f?
      • 2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability
      • 2.1.4 Supervised Versus Unsupervised Learning
      • 2.1.5 Regression Versus Classification Problems

Video SL 1.1 Opening Remarks - 18:19
Video SL 1.2 Examples and Framework - 12:13
Video SL 2.1 Introduction to Regression Models - 11:42
Video SL 2.2 Dimensionality and Structured Models - 11:41