library(broom)
library(ggplot2)
library(rempsyc)
library(AER)
Skills Lab 10: Linear model (3)
Today’s Session
The linear model with multiple predictors
One-predictor model (recap)
Two-predictor model
Fitting the model
Comparing models
Interpreting the better model
Comparing predictors
Plotting???
Kahoot!
Kahoot quizzes are NOT assessed but are great practice for the assessed Canvas quizzes…and for the exam 😱
Materials
Try out your own solutions in this document
A solutions file will be uploaded to the website after the session.
Study
We have new (real) data today from the STAR project, which assessed the effect of class size on test scores.
Packages
Data
## Load the data from the AER package
data(STAR)
## Rename to tidy conventions
<- STAR star_tib
Codebook
This is a dataset that comes with a package, so we can use its name to search in the help documentation! Run the following in the Console.
?STAR
Research Question
What predicts a child’s reading score at a young age?
Outcome: Reading score
Possible predictors: Maths score, years of teacher experience
The Model with Multiple Predictors
The linear model with a single predictor should be very familiar by now:
\[y_i = b_0 + b_1x_{1i} + \epsilon\]
To include more predictors, we simply add them in to the model. Each new predictor will also have its own b, representing the relationship between that predictor and the outcome:
\[y_i = b_0 + b_1x_{1i} + b_2x_{2i} + b_3x_{3i} \text{...} + b_nx_{ni} + \epsilon\]
One-Predictor Model
To begin, let’s look briefly at a single-predictor model containing only maths score as a predictor.
\[\text{Reading Score}_i = b_0 + \text{Maths Score}_1x_{1i} + \epsilon\]
<- lm(read1 ~ math1, data = star_tib)
read_lm summary(read_lm)
Call:
lm(formula = read1 ~ math1, data = star_tib)
Residuals:
Min 1Q Median 3Q Max
-136.318 -26.391 -4.187 22.657 187.004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.96108 5.89020 4.408 1.06e-05 ***
math1 0.93199 0.01106 84.295 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37.94 on 6377 degrees of freedom
(5219 observations deleted due to missingness)
Multiple R-squared: 0.527, Adjusted R-squared: 0.5269
F-statistic: 7106 on 1 and 6377 DF, p-value: < 2.2e-16
Is maths score a significant predictor of reading score?
How much variance in reading score does the model explain?
Two-Predictor Model
Fitting the Model
To fit the model with two predictors, we literally add in the new predictor to the formula!
<- lm(read1 ~ math1 + experience1, data = star_tib) read_full_lm
Comparing Models
Before we review the models in depth, we should choose the better of the two models. This will be the model that is the best fit to the data.
First, let’s compare each model to the null model.
::glance(read_lm) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.527 0.527 37.9 7106. 0 1 -32245. 64497. 64517.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
::glance(read_full_lm) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.529 0.529 37.9 3579. 0 2 -32233. 64473. 64500.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
How much variance does each model explain?
How much MORE variance does model 2 explain than model 1?
Is each model a significant improvement on the null model?
Of course we can just look at the numbers or type them out by hand, but we can also extract them using what we know about subsetting:
round(broom::glance(read_lm)$r.squared*100, 2)
[1] 52.7
round(broom::glance(read_full_lm)$r.squared*100, 2)
[1] 52.89
round((broom::glance(read_full_lm)$r.squared - broom::glance(read_lm)$r.squared)*100, 2)
[1] 0.19
To figure out which model is the better model, we can next compare the two models to each other using anova()
. This is like if we use the first model as the “null” model, and test whether the second model is a significant improvement.
anova(read_lm, read_full_lm)
Analysis of Variance Table
Model 1: read1 ~ math1
Model 2: read1 ~ math1 + experience1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6377 9181517
2 6376 9145157 1 36360 25.35 4.912e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is the better model: model 1 (with one predictor) or model 2 (with two predictors)?
Interpreting the Better Model
Now that we’ve chosen the better model, we’ll carry on with that model only.
Let’s look at the better model in more depth:
summary(read_full_lm)
Call:
lm(formula = read1 ~ math1 + experience1, data = star_tib)
Residuals:
Min 1Q Median 3Q Max
-133.978 -26.363 -4.008 22.523 182.076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.07830 5.90681 3.907 9.44e-05 ***
math1 0.93158 0.01104 84.416 < 2e-16 ***
experience1 0.26681 0.05299 5.035 4.91e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37.87 on 6376 degrees of freedom
(5219 observations deleted due to missingness)
Multiple R-squared: 0.5289, Adjusted R-squared: 0.5287
F-statistic: 3579 on 2 and 6376 DF, p-value: < 2.2e-16
Which predictor(s) are significant in this model?
How can you interpret the bs?
Comparing Predictors
Finally, we might want to know which predictor has the strongest impact on the outcome. To do this, we can use standardised Bs.
::model_parameters(read_full_lm, standarize = "refit") parameters
Parameter | Coefficient | SE | 95% CI | t(6376) | p
--------------------------------------------------------------------
(Intercept) | 23.08 | 5.91 | [11.50, 34.66] | 3.91 | < .001
math1 | 0.93 | 0.01 | [ 0.91, 0.95] | 84.42 | < .001
experience1 | 0.27 | 0.05 | [ 0.16, 0.37] | 5.03 | < .001
Which predictor(s) has the strongest impact on the outcome?
Plotting???
Note: 3D plots really should be used with caution, and this bit is just for fun - feel free to skip if you’re not inclined. If you’re keen to have a go, the code below gets you an interactive, useable 3D scatterplot in no time at all, but think carefully about whether this is really the best way to represent your data.
library(rgl)
::plot3d(read_full_lm,
rglcol = "darkcyan",
plane.col = "darkmagenta",
xlab = "Grade 1 Math Score",
ylab = "Grade 1 Reading Score",
zlab = "Grade 1 Teacher Experience")