Exercises Unit B - Applied

Bachelor’s Degree Programme in Philosophy, International and Economic Studies, Ca’ Foscari University of Venice.

Author

Affiliation

Aldo Solari

Department of Economics, Ca’ Foscari University of Venice

This question involves the use of simple linear regression on the Auto data set.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:
1. Is there a relationship between the predictor and the response?
2. How strong is the relationship between the predictor and the response?
3. Is the relationship between the predictor and the response positive or negative?
4. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

Recall that the coefficient estimate \hat{\beta} for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

Using the rnorm() function, create a vector x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.
Using the rnorm() function, create a vector eps, containing 100 observations drawn from a N(0, 0.25) distribution, a normal distribution with mean zero and variance 0.25.
Using x and eps, generate a vector y according to the model

Y = -1 + 0.5X + \epsilon

What is the length of the vector y? What are the values of \beta_0 and \beta_1 in this linear model?

Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \hat{\beta}_0 and \hat{\beta}_1 compare to \beta_0 and \beta_1?
Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.
Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \epsilon in (b). Describe your results.
Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \epsilon in (b). Describe your results.
What are the confidence intervals for \beta_0 and \beta_1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

We will now perform cross-validation on a simulated data set.

set.seed(1)
x <- rnorm(100)
y <- x - 2 * x^2 + rnorm(100)

In this data set, what is n and what is p?
Write out the model used to generate the data in equation form.

Create a scatterplot of X against Y. Comment on what you find.
Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
1. Y = \beta_0 + \beta_1 X + \epsilon
2. Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon
3. Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon
4. Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon

Note: You may find it helpful to use the data.frame() function to create a single data set containing both X and Y.

Repeat (c) using another random seed, and report your results.
Are your results the same as what you got in (c)? Why?
Which of the models in (c) had the smallest LOOCV error?
Is this what you expected? Explain your answer.
Comment on the statistical significance of the coefficient estimates that result from fitting each of the models in (c) using least squares.
Do these results agree with the conclusions drawn based on the cross-validation results?