Skills Lab 10: Linear model (3)

Author

JM

Today’s Session

  • The linear model with multiple predictors

  • One-predictor model (recap)

  • Two-predictor model

    • Fitting the model

    • Comparing models

    • Interpreting the better model

    • Comparing predictors

  • Plotting???

  • Kahoot!

Tip

Kahoot quizzes are NOT assessed but are great practice for the assessed Canvas quizzes…and for the exam 😱

Materials

  • Try out your own solutions in this document

  • A solutions file will be uploaded to the website after the session.

Study

We have new (real) data today from the STAR project, which assessed the effect of class size on test scores.

Packages

library(broom)
library(ggplot2)
library(rempsyc)
library(AER)

Data

## Load the data from the AER package
data(STAR)

## Rename to tidy conventions
star_tib <- STAR

Codebook

This is a dataset that comes with a package, so we can use its name to search in the help documentation! Run the following in the Console.

?STAR

Research Question

What predicts a child’s reading score at a young age?

Outcome: Reading score

Possible predictors: Maths score, years of teacher experience

The Model with Multiple Predictors

The linear model with a single predictor should be very familiar by now:

\[y_i = b_0 + b_1x_{1i} + \epsilon\]

To include more predictors, we simply add them in to the model. Each new predictor will also have its own b, representing the relationship between that predictor and the outcome:

\[y_i = b_0 + b_1x_{1i} + b_2x_{2i} + b_3x_{3i} \text{...} + b_nx_{ni} + \epsilon\]

One-Predictor Model

To begin, let’s look briefly at a single-predictor model containing only maths score as a predictor.

\[\text{Reading Score}_i = b_0 + \text{Maths Score}_1x_{1i} + \epsilon\]

read_lm <- lm(read1 ~ math1, data = star_tib)
summary(read_lm)

Call:
lm(formula = read1 ~ math1, data = star_tib)

Residuals:
     Min       1Q   Median       3Q      Max 
-136.318  -26.391   -4.187   22.657  187.004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.96108    5.89020   4.408 1.06e-05 ***
math1        0.93199    0.01106  84.295  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.94 on 6377 degrees of freedom
  (5219 observations deleted due to missingness)
Multiple R-squared:  0.527, Adjusted R-squared:  0.5269 
F-statistic:  7106 on 1 and 6377 DF,  p-value: < 2.2e-16
Pop Quiz!

Is maths score a significant predictor of reading score?

How much variance in reading score does the model explain?

Two-Predictor Model

Fitting the Model

To fit the model with two predictors, we literally add in the new predictor to the formula!

read_full_lm <- lm(read1 ~ math1 + experience1, data = star_tib)

Comparing Models

Before we review the models in depth, we should choose the better of the two models. This will be the model that is the best fit to the data.

First, let’s compare each model to the null model.

broom::glance(read_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.527         0.527  37.9     7106.       0     1 -32245. 64497. 64517.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(read_full_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.529         0.529  37.9     3579.       0     2 -32233. 64473. 64500.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Pop Quiz!

How much variance does each model explain?

How much MORE variance does model 2 explain than model 1?

Is each model a significant improvement on the null model?

Of course we can just look at the numbers or type them out by hand, but we can also extract them using what we know about subsetting:

round(broom::glance(read_lm)$r.squared*100, 2)
[1] 52.7
round(broom::glance(read_full_lm)$r.squared*100, 2)
[1] 52.89
round((broom::glance(read_full_lm)$r.squared - broom::glance(read_lm)$r.squared)*100, 2)
[1] 0.19

To figure out which model is the better model, we can next compare the two models to each other using anova(). This is like if we use the first model as the “null” model, and test whether the second model is a significant improvement.

anova(read_lm, read_full_lm)
Analysis of Variance Table

Model 1: read1 ~ math1
Model 2: read1 ~ math1 + experience1
  Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
1   6377 9181517                                 
2   6376 9145157  1     36360 25.35 4.912e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pop Quiz!

Which is the better model: model 1 (with one predictor) or model 2 (with two predictors)?

Interpreting the Better Model

Now that we’ve chosen the better model, we’ll carry on with that model only.

Let’s look at the better model in more depth:

summary(read_full_lm)

Call:
lm(formula = read1 ~ math1 + experience1, data = star_tib)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.978  -26.363   -4.008   22.523  182.076 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.07830    5.90681   3.907 9.44e-05 ***
math1        0.93158    0.01104  84.416  < 2e-16 ***
experience1  0.26681    0.05299   5.035 4.91e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.87 on 6376 degrees of freedom
  (5219 observations deleted due to missingness)
Multiple R-squared:  0.5289,    Adjusted R-squared:  0.5287 
F-statistic:  3579 on 2 and 6376 DF,  p-value: < 2.2e-16
Pop Quiz!

Which predictor(s) are significant in this model?

How can you interpret the bs?

Comparing Predictors

Finally, we might want to know which predictor has the strongest impact on the outcome. To do this, we can use standardised Bs.

parameters::model_parameters(read_full_lm, standarize = "refit")
Parameter   | Coefficient |   SE |         95% CI | t(6376) |      p
--------------------------------------------------------------------
(Intercept) |       23.08 | 5.91 | [11.50, 34.66] |    3.91 | < .001
math1       |        0.93 | 0.01 | [ 0.91,  0.95] |   84.42 | < .001
experience1 |        0.27 | 0.05 | [ 0.16,  0.37] |    5.03 | < .001
Pop Quiz!

Which predictor(s) has the strongest impact on the outcome?

Plotting???

Warning

Note: 3D plots really should be used with caution, and this bit is just for fun - feel free to skip if you’re not inclined. If you’re keen to have a go, the code below gets you an interactive, useable 3D scatterplot in no time at all, but think carefully about whether this is really the best way to represent your data.

library(rgl)

rgl::plot3d(read_full_lm,
            col = "darkcyan", 
            plane.col = "darkmagenta",
            xlab = "Grade 1 Math Score", 
            ylab = "Grade 1 Reading Score", 
            zlab = "Grade 1 Teacher Experience")

Kahoot!