# read data
<- readr::read_csv("https://and.netlify.app/datasets/gensex_2022.csv") gensex
Tutorial 10: Linear model (3)
Multiple predictors
Introduction
In the previous tutorial, we practised fitting a linear model with a single predictor to see if we can say anything sexual attraction based on the level of romantic attraction someone experiences. This time, we’ll expand on that model by adding another predictor, and look at how to evaluate the models and predictors in comparison to each other. We will therefore build more sophisticated models compared to models in the previous couple of tutorials.
Warm-up Quiz
Before we start, why not get in the mood by answering a few quiz questions!
As we saw this week, the general equation for the linear model with 2 predictors is
\[y_i=b_0 + b_1x_{1i} + b_2 x_{2i} + e_i\]
Or, if you prefer to explicitly show the multiplication sign, you could write:
\[y_i=b_0 + b_1\times x_{1i} + b_2 \times x_{2i} + e_i\]
In other words, it’s identical to the equation with a single predictor, with one additional term, \(b_2 x_{2i}\), which is the b value associated with the second predictor, \(x_{2}\).
So, we can expand the equation of the line by adding as many additional predictors as we like1, just adding additional bs for each predictor.
Setting Up
In this tutorial, we’ll return again to the same model we fit last week and add another predictor to see what we can learn from a bigger model with additional predictors.
Task 1
Load packages and data.
As last week, we’ll need tidyverse
and broom
.
- Load the necessary packages.
- Read in the
gensex.csv
dataset using the code chunk below. - Explore the dataset to remind yourself of the variables it contains.
One Predictor
Task 2
Now that we have our data ready, let’s again fit the same single predictor model as last week, predicting the frequency of sexual attraction based on the frequency of romantic attraction.
- Fit the model using
lm()
- Save its output into
attract_lm
. - Print the result of
attract_lm
- note down the value of the intercept and the slope. - Extract a summary of this model by piping the model into the
summary()
function.
This is the exact same model as last week, so we won’t look at it in depth here; have a look at last week’s tutorial for a reminder if you’re not sure to how to interpret this output.
Now that you’ve refreshed your memory on how to build a single-predictor model, let’s extend it by adding another predictor.
Two Predictors
We might suspect that frequency of romantic attraction isn’t the only thing that predicts the frequency of sexual attraction. We may also, for instance, hypothesise that people who experience sexual attraction more strongly may also experience it more frequently than people who don’t (for example, asexual or demisexual people, who experience sexual attraction seldom, weakly, or not at all).
Task 3
- Fit the new, two-predictor model using
lm()
- Save its output into
attract_full_lm
. - Produce summaries of
attract_full_lm
.
All of this output for the two-predictor model will look almost exactly like the one-predictor model, with one major difference: there is an additional row in the summary()
and broom::tidy()
outputs for the new predictor. Have a quick look at the new output.
As you have a look at this output, remember that the interpretation of the slopes in a multi-predictor model is different from that in a single-predictor model.
In our particular model, the b coefficient for strength of sexual attraction represents the increase (or decrease) in frequency of sexual attraction while also accounting for the frequency of romantic attraction. Conversely, the b coefficient for frequency of romantic attraction represents the increase (or decrease) in frequency of sexual attraction while also accounting for strength of sexual attraction.
Comparing Models
We now have two models: one with one predictor, and one with two. How do we know which one is “right”, or “best”, or “true”? It might be tempting to think that the model with more predictors is always better, but this isn’t necessarily the case. What we’d like to produce is a parsimonious model: essentially, we want a model that explains as much as possible, with as few predictors as possible. In other words, we don’t want a model that’s dragging around tons of extra predictors that aren’t actually helping.
So, how can we find out whether the second model, with two predictors, is an improvement on the first model with one? Of course, we can do some statistics about it!
R2 Change
The first thing we can investigate is the change in R2 between the two models. Remember that R2 is the proportion of variance in the outcome accounted for by the model. If the second model is better than the first, then it should explain more of the variance. So, let’s have a look.
Task 4
Using whatever output you like, obtain the R2 values for the first model, attract_lm
, and the second model, attract_full_lm
. Make sure you convert these decimal proportions into percentages, then round to two decimal places.
Then, answer the questions below.
This change in R2 represents the increased amount of explained variance between the second model and the first. This improvement is a lot - that’s great! But is it enough to think that the second model is better than the first? In other words, how do we know how much R2 change enough enough?
It might be tempting to think that if we want to know the total proportion of variance explained by the two predictors, we can simply sum the R2 from model 1 and from model 2. This would not be quite right. Remember that R2 can go from 0 to 1, or, in percentage terms, from 0 to 100. Imagine a situation where a single-predictor model with frequency of romantic attraction explains a whopping 53% of variance. At the same time, a single predictor model with strength of sexual attraction explains even more whopping 61% of variance (unlikely, but it can happen). If we were to sum these, we’d end up with a value of 114%, explaining away the variance that isn’t even there!
When we’re considering multiple predictors, some of the variance accounted for by these predictors will be shared between them. We need to therefore consider the R2 from the full model, not from the individual models.
As we’ve seen before, in statistics we often aren’t satisfied with only the number itself; we want to know whether this number is big or small in the wider context. For that, we’ll need a test.
F-change
Last week we encountered the F-statistic for the first time. This is the test of the null hypothesis that the model we’ve created is not significantly different from a model with no predictors at all. If it is significant, we conclude that adding whatever predictors we added to create that model was “worth the effort”, so to speak. If it is not significant, we conclude that adding those predictors was not worth the effort - in other words, adding in the predictors didn’t get us a better model than we would have had with no predictors at all!
This time around, we’re going to do something very similar. However, instead of comparing each model to the null model with no predictors, we will be comparing our models to each other. The output and information we get from this F-change test will look quite a bit like the ones from last week, but with a slightly different interpretation:
- If the model comparison F-test is significant, we conclude that adding whatever predictors we added to create the second model results in a significantly better model than the model without them. In other words, the second model with more predictors is better than the first model with fewer.
- In this case, we should retain the second model as the better model.
- If the model comparison F-test is not significant, we conclude that adding those predictors to create the second model did not result in a better model. In other words, the second model is no better than the first.
- In this case, we should retain the first model as the better model.
This process can only be done with models that are hierarchical, or nested. This means all of predictors that were in the first model must also be in the second. If this is not the case, you will get an error!
Task 5
Use the anova()
function to compare the two models, attract_lm
and attract_full_lm
, and print out the output.
Then, answer the questions below.
Comparing Predictors
So far we’ve seen that we can add more predictors to a linear model and then test whether the bigger, more complex model with more predictors is a better model than the one with fewer predictors. In our analysis so far, we’ve seen that the second model, attract_full_lm
with two predictors, is a better model than the first, so we’ll focus on it from here on out.
The last key piece of information we want to obtain before we wrap up and report our analysis involves comparing predictors. We already had a look at which predictors in the second model were significant, but we might be particularly interested to know which of the two of them has a more substantial impact on the outcome.
As discussed in the lectures, we can’t use the normal, unstandardised bs for this that we have been using and reporting so far. Instead, we need standardised Bs, measured in standard deviation units. Essentially, we can use standardised Bs to directly compare impact of predictors to each other, simply by identifying which predictor has the largest absolute value (i.e. ignoring whether the numbers are positive or negative).
Task 6
- Put the model into the function
parameters::model_parameters()
function. - Add the argument
standardize = "refit"
to output standardised Bs.
Which of the two predictors has the strongest effect on the outcome?
Note that you must spell standardize the American way with a “z”, otherwise the argument will not do its job!
Reporting
To conclude, we’re going to write up the results of the analysis we’ve done. Use the output you have produced to fill in the blanks below, and make sure to include a copy of the final report in your Quarto document.
Very well done with everything you’ve done. You made it to the end!
If you’re still not ready to be done, we have one last ChallengR for you to finish the term in style.
ChallengR
This task is a ChallengR, which are always optional, and will never be assessed - they’re only there to inspire you to try new things! If you solve this task successfully, you can earn a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers. (See the Games and Awards page on Canvas.)
There are no solutions in this document for this ChallengR task. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.
This last ChallengR is a doozy. You’ll need to draw on skills from across the module to get through it. Enter if you dare…
To begin the ChallengR, open the Great Escape in a new tab, or use the code chunks in your tutorial workbook to open it. Read the rules, then begin the challenge by clicking the “I am ready!” button at the bottom of the page.
Note that you have a maximum of 1.5 hours to complete the Great Escape, so make sure you have enough time set aside!
To earn your Kahoot! Points, you must complete the escape challenge in the time available and add your Kahoot! username to the leaderboard. If you also want to include your real name, you could do this with your Kahoot! username as your surname. The leaderboard is publicly visible on the Internet, though, so be careful if you want to remain anonymous. Remember that you must enter your Kahoot! username exactly in order to get your points!
If you want to get the ChallengR points but don’t want to add your name to the leaderboard, or if your Kahoot name contains any special characters like emojis, enter whatever name you would like on the leaderboard and then email analysingdata.psychology@sussex.ac.uk to tell us which name is yours.
Good luck…
Footnotes
In practice, this isn’t the best idea - or, rather, which predictor(s) you add should be theoretically valuable and/or interesting, and/or supported by evidence. We’ll come back to this in second year!↩︎