Tutorial 10: Linear model (3)

Multiple predictors

Introduction

In the previous tutorial, we practised fitting a linear model with a single predictor to see if we can say anything sexual attraction based on the level of romantic attraction someone experiences. This time, we’ll expand on that model by adding another predictor, and look at how to evaluate the models and predictors in comparison to each other. We will therefore build more sophisticated models compared to models in the previous couple of tutorials.

Warm-up Quiz

Before we start, why not get in the mood by answering a few quiz questions!

As we saw this week, the general equation for the linear model with 2 predictors is

$y_{i} = b_{0} + b_{1} x_{1 i} + b_{2} x_{2 i} + e_{i}$

Or, if you prefer to explicitly show the multiplication sign, you could write:

$y_{i} = b_{0} + b_{1} \times x_{1 i} + b_{2} \times x_{2 i} + e_{i}$

In other words, it’s identical to the equation with a single predictor, with one additional term, $b_{2} x_{2 i}$ , which is the b value associated with the second predictor, $x_{2}$ .

Question 1

How many slopes will there be in a model with 2 predictors?

Question 2

How many intercepts will there be in a model with 2 predictors?

Question 3

In general terms, how many b coefficients will there be in a model with n predictors? For example a 2-predictor model has coefficients $b_{0}$ , $b_{1}$ and $b 2$ . A 3-predictor model has coefficients $b_{0}$ , $b_{1}$ , $b_{2}$ and $b_{3}$ . How about an n-predictor model?

2n n n + 1 n − 1

Question 4

What is the geometrical representation of the linear model with 2 predictors?

Line in a 2D space 3D plane in a 3D space Line in a 3D space 2D plane in a 3D space

So, we can expand the equation of the line by adding as many additional predictors as we like¹, just adding additional bs for each predictor.

Setting Up

In this tutorial, we’ll return again to the same model we fit last week and add another predictor to see what we can learn from a bigger model with additional predictors.

Task 1

Load packages and data.

As last week, we’ll need tidyverse and broom.

Load the necessary packages.
Read in the gensex.csv dataset using the code chunk below.
Explore the dataset to remind yourself of the variables it contains.

# read data 
gensex <- readr::read_csv("data/gensex.csv")

Hint

You need to use the library() function to load each package.

You can add this into a code chunk somewhere at the top of your tutorial notebook”

library(broom)
library(tidyverse)

One Predictor

Task 2

Now that we have our data ready, let’s again fit the same single predictor model as last week, predicting the frequency of sexual attraction based on the frequency of romantic attraction.

Fit the model using lm()
Save its output into attract_lm.
Print the result of attract_lm - note down the value of the intercept and the slope.
Extract a summary of this model by piping the model into the summary() function.

Hint

Remember, you need formula and data for the lm() function. Formula takes the form of outcome ~ predictor

Solution

# Fit a linear model: 
attract_lm <- lm(sexual_freq ~ romantic_freq, data = gensex)

# Print the result:
attract_lm


Call:
lm(formula = sexual_freq ~ romantic_freq, data = gensex)

Coefficients:
  (Intercept)  romantic_freq  
       3.3814         0.4682

# Extract the summary
attract_lm |> summary()


Call:
lm(formula = sexual_freq ~ romantic_freq, data = gensex)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1903 -1.0574  0.2778  1.2142  4.6823 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.38137    0.30476  11.095   <2e-16 ***
romantic_freq  0.46816    0.04848   9.657   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.789 on 304 degrees of freedom
Multiple R-squared:  0.2347,    Adjusted R-squared:  0.2322 
F-statistic: 93.25 on 1 and 304 DF,  p-value: < 2.2e-16

This is the exact same model as last week, so we won’t look at it in depth here; have a look at last week’s tutorial for a reminder if you’re not sure to how to interpret this output.

Now that you’ve refreshed your memory on how to build a single-predictor model, let’s extend it by adding another predictor.

Two Predictors

We might suspect that frequency of romantic attraction isn’t the only thing that predicts the frequency of sexual attraction. We may also, for instance, hypothesise that people who experience sexual attraction more strongly may also experience it more frequently than people who don’t (for example, asexual or demisexual people, who experience sexual attraction seldom, weakly, or not at all).

Task 3

Fit the new, two-predictor model using lm()
Save its output into attract_full_lm.
Produce summaries of attract_full_lm.

Hint

The general form of the lm function is:

model_name <- lm(outcome ~ predictor + predictor ... , some_dataset)

Or, with the pipe:

model_name <- dataset |> 
  lm(outcome ~ predictor + predictor ... , data = _)

Solution

Fit the model:

attract_full_lm <- gensex |> 
  lm(sexual_freq ~ romantic_freq + sexual_strength, data = _)

Get an overall summary with summary():

summary(attract_full_lm)


Call:
lm(formula = sexual_freq ~ romantic_freq + sexual_strength, data = gensex)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7776 -0.7776  0.0350  0.9440  3.8372 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.17373    0.39467  -0.440     0.66    
romantic_freq    0.29553    0.04286   6.895 3.14e-11 ***
sexual_strength  0.60819    0.05181  11.739  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.486 on 303 degrees of freedom
Multiple R-squared:  0.474, Adjusted R-squared:  0.4705 
F-statistic: 136.5 on 2 and 303 DF,  p-value: < 2.2e-16

View parameter estimates with confidence intervals:

attract_full_lm |>  broom::tidy(conf.int = TRUE)

# A tibble: 3 × 7
  term            estimate std.error statistic  p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)       -0.174    0.395     -0.440 6.60e- 1   -0.950     0.603
2 romantic_freq      0.296    0.0429     6.89  3.14e-11    0.211     0.380
3 sexual_strength    0.608    0.0518    11.7   1.76e-26    0.506     0.710

View model fit statistics:

attract_full_lm |>  broom::glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.474         0.471  1.49      137. 5.39e-43     2  -554. 1116. 1131.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

All of this output for the two-predictor model will look almost exactly like the one-predictor model, with one major difference: there is an additional row in the summary() and broom::tidy() outputs for the new predictor. Have a quick look at the new output.

Answer the following quiz questions using the output from this first model.

Question 5

TRUE or FALSE: This model is a significant improvement on the null model.

Question 6

TRUE or FALSE: Frequency of romantic attraction is a significant predictor of frequency of sexual attraction.

Question 7

Complete the sentence below. For any numbers, round to two decimal places where appropriate.

As frequency of romantic attraction increases by unit, predicted frequency of sexual attraction by , holding all other predictors constant.

Question 8

TRUE or FALSE: Strength of sexual attraction is a significant predictor of frequency of sexual attraction.

Question 9

Complete the sentence below. For any numbers, round to two decimal places where appropriate.

As strength of sexual attraction increases by unit, predicted frequency of sexual attraction by , holding all other predictors constant.

As you have a look at this output, remember that the interpretation of the slopes in a multi-predictor model is different from that in a single-predictor model.

In our particular model, the b coefficient for strength of sexual attraction represents the increase (or decrease) in frequency of sexual attraction while also accounting for the frequency of romantic attraction. Conversely, the b coefficient for frequency of romantic attraction represents the increase (or decrease) in frequency of sexual attraction while also accounting for strength of sexual attraction.

Comparing Models

We now have two models: one with one predictor, and one with two. How do we know which one is “right”, or “best”, or “true”? It might be tempting to think that the model with more predictors is always better, but this isn’t necessarily the case. What we’d like to produce is a parsimonious model: essentially, we want a model that explains as much as possible, with as few predictors as possible. In other words, we don’t want a model that’s dragging around tons of extra predictors that aren’t actually helping.

So, how can we find out whether the second model, with two predictors, is an improvement on the first model with one? Of course, we can do some statistics about it!

R² Change

The first thing we can investigate is the change in R² between the two models. Remember that R² is the proportion of variance in the outcome accounted for by the model. If the second model is better than the first, then it should explain more of the variance. So, let’s have a look.

Task 4

Using whatever output you like, obtain the R² values for the first model, attract_lm, and the second model, attract_full_lm. Make sure you convert these decimal proportions into percentages, then round to two decimal places.

Then, answer the questions below.

Hint

You can find the R² values for each model in the output from broom::glance().

To convert to a percentage, multiply by 100.

To find the difference, subtract one from the other.

Question 10

What is the R² of the first model, attract_lm, as a percentage (2 d.p.)?

Question 11

What is the R² of the second model, attract_full_lm, as a percentage (2 d.p.)?

Question 12

What is change in R² between the models (2 d.p.)?

This change in R² represents the increased amount of explained variance between the second model and the first. This improvement is a lot - that’s great! But is it enough to think that the second model is better than the first? In other words, how do we know how much R² change enough enough?

MoRe About: Interpreting R²s

It might be tempting to think that if we want to know the total proportion of variance explained by the two predictors, we can simply sum the R² from model 1 and from model 2. This would not be quite right. Remember that R² can go from 0 to 1, or, in percentage terms, from 0 to 100. Imagine a situation where a single-predictor model with frequency of romantic attraction explains a whopping 53% of variance. At the same time, a single predictor model with strength of sexual attraction explains even more whopping 61% of variance (unlikely, but it can happen). If we were to sum these, we’d end up with a value of 114%, explaining away the variance that isn’t even there!

When we’re considering multiple predictors, some of the variance accounted for by these predictors will be shared between them. We need to therefore consider the R² from the full model, not from the individual models.

As we’ve seen before, in statistics we often aren’t satisfied with only the number itself; we want to know whether this number is big or small in the wider context. For that, we’ll need a test.

F-change

Last week we encountered the F-statistic for the first time. This is the test of the null hypothesis that the model we’ve created is not significantly different from a model with no predictors at all. If it is significant, we conclude that adding whatever predictors we added to create that model was “worth the effort”, so to speak. If it is not significant, we conclude that adding those predictors was not worth the effort - in other words, adding in the predictors didn’t get us a better model than we would have had with no predictors at all!

This time around, we’re going to do something very similar. However, instead of comparing each model to the null model with no predictors, we will be comparing our models to each other. The output and information we get from this F-change test will look quite a bit like the ones from last week, but with a slightly different interpretation:

If the model comparison F-test is significant, we conclude that adding whatever predictors we added to create the second model results in a significantly better model than the model without them. In other words, the second model with more predictors is better than the first model with fewer.
- In this case, we should retain the second model as the better model.
If the model comparison F-test is not significant, we conclude that adding those predictors to create the second model did not result in a better model. In other words, the second model is no better than the first.
- In this case, we should retain the first model as the better model.

Note

This process can only be done with models that are hierarchical, or nested. This means all of predictors that were in the first model must also be in the second. If this is not the case, you will get an error!

Task 5

Use the anova() function to compare the two models, attract_lm and attract_full_lm, and print out the output.

Then, answer the questions below.

Hint

The anova() function needs the names of both model objects, like this:

anova(model_1, model_2)

Solution

anova(attract_lm, attract_full_lm)

Analysis of Variance Table

Model 1: sexual_freq ~ romantic_freq
Model 2: sexual_freq ~ romantic_freq + sexual_strength
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    304 973.24                                  
2    303 668.98  1    304.26 137.81 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 13

What is the value of the F-change statistic (2 d.p.)?

Question 14

What is the p value for this F-test, in APA style?

< .001 > .01 < 2.2e-16 < .05

Question 15

Based on this output, which is the better model: attract_lm (one predictor) or attract_full_lm (two predictors)?

attract_full_lm attract_lm

Comparing Predictors

So far we’ve seen that we can add more predictors to a linear model and then test whether the bigger, more complex model with more predictors is a better model than the one with fewer predictors. In our analysis so far, we’ve seen that the second model, attract_full_lm with two predictors, is a better model than the first, so we’ll focus on it from here on out.

The last key piece of information we want to obtain before we wrap up and report our analysis involves comparing predictors. We already had a look at which predictors in the second model were significant, but we might be particularly interested to know which of the two of them has a more substantial impact on the outcome.

As discussed in the lectures, we can’t use the normal, unstandardised bs for this that we have been using and reporting so far. Instead, we need standardised Bs, measured in standard deviation units. Essentially, we can use standardised Bs to directly compare impact of predictors to each other, simply by identifying which predictor has the largest absolute value (i.e. ignoring whether the numbers are positive or negative).

Task 6

Put the model into the function parameters::model_parameters() function.
Add the argument standardize = "refit" to output standardised Bs.

Which of the two predictors has the strongest effect on the outcome?

Note that you must spell standardize the American way with a “z”, otherwise the argument will not do its job!

Hint

Here’s the basic setup to get you started:

parameters::model_parameters(model_name, ???)

Add in the standardize = "refit" argument to obtain standardised Bs.

parameters::model_parameters(model_name, standardize = "refit")

Solution

parameters::model_parameters(attract_full_lm,
                             standardize= "refit")

Parameter       | Coefficient |   SE |        95% CI |    t(303) |      p
-------------------------------------------------------------------------
(Intercept)     |   -3.37e-16 | 0.04 | [-0.08, 0.08] | -8.11e-15 | > .999
romantic freq   |        0.31 | 0.04 | [ 0.22, 0.39] |      6.89 | < .001
sexual strength |        0.52 | 0.04 | [ 0.43, 0.61] |     11.74 | < .001

All we are interested in in this output is the “Coefficient” column, which gives us the standardised Bs. Again, we can just interpret the size of their actual values! Strength of sexual attraction has a larger value than frequency of romantic attraction, so it is the strongest predictor of the outcome.

Reporting

To conclude, we’re going to write up the results of the analysis we’ve done. Use the output you have produced to fill in the blanks below, and make sure to include a copy of the final report in your Quarto document.

In this analysis, we created two linear models, both predicting frequency of sexual attraction. The first model contained only one predictor, which was frequency of romantic attraction (R² = %). The second model contained the same predictor and added strength of sexual attraction as a second predictor (R² = %, R²-change = %). The model comparison indicated that the second, two-predictor model was a significant improvement over the first, so the second model was retained as the better model (F(, ) = , p < ).

The first predictor, frequency of romantic attraction, had a significant positive relationship with the outcome, which was frequency of sexual attraction (b = , SE(b) = , 95% CI [, ], t = , p ). The second predictor, strength of sexual attraction, also had a significant positive relationship with frequency of sexual attraction (b = , SE(b) = , 95% CI [, ], t = , p ). Of the two predictors, strength of sexual attraction had a stronger impact on frequency of sexual attraction (B = ) than frequency of romantic attraction (B = ).

Very well done with everything you’ve done. You made it to the end!

If you’re still not ready to be done, we have one last ChallengR for you to finish the term in style.

ChallengR

ChallengR Time!

This task is a ChallengR, which are always optional, and will never be assessed - they’re only there to inspire you to try new things! If you solve this task successfully, you can earn a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers. (See the Games and Awards page on Canvas.)

There are no solutions in this document for this ChallengR task. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.

This last ChallengR is a doozy. You’ll need to draw on skills from across the module to get through it. Enter if you dare…

To begin the ChallengR, open the Great Escape in a new tab, or use the code chunks in your tutorial workbook to open it. Read the rules, then begin the challenge by clicking the “I am ready!” button at the bottom of the page.

Note that you have a maximum of 1.5 hours to complete the Great Escape, so make sure you have enough time set aside!

Hint

Just about everything you will need for these tasks has been covered on this module. Think carefully about what you know and how you can get at the information you need. Don’t panic, and read the instructions on each page carefully!

The only element we haven’t covered in depth is how to fit a null model. To create the null model with the lm() function, use 1 as the predictor, like so:

lm(outcome ~ 1, data = data)

To earn your Kahoot! Points, you must complete the escape challenge in the time available and add your Kahoot! username to the leaderboard. If you also want to include your real name, you could do this with your Kahoot! username as your surname. The leaderboard is publicly visible on the Internet, though, so be careful if you want to remain anonymous. Remember that you must enter your Kahoot! username exactly in order to get your points!

If you want to get the ChallengR points but don’t want to add your name to the leaderboard, or if your Kahoot name contains any special characters like emojis, enter whatever name you would like on the leaderboard and then email analysingdata.psychology@sussex.ac.uk to tell us which name is yours.

Good luck…

Footnotes

In practice, this isn’t the best idea - or, rather, which predictor(s) you add should be theoretically valuable and/or interesting, and/or supported by evidence. We’ll come back to this in second year!↩︎

Introduction

Warm-up Quiz

Question 1

Question 2

Question 3

Question 4

Setting Up

Task 1

One Predictor

Task 2

Two Predictors

Task 3

Question 5

Question 6

Question 7

Question 8

Question 9

Comparing Models

R2 Change

Task 4

Question 10

Question 11

Question 12

F-change

Task 5

Question 13

Question 14

Question 15

Comparing Predictors

Task 6

Reporting

ChallengR

Footnotes

R² Change