library(tidyverse)
Skills Lab 02: Data cleaning
Setup
Packages
Revision: Logical Assertions
We’ll be making use again of logical assertions. These are statements that can be either TRUE
or FALSE
- they return logical data.
Get R to return TRUE
and FALSE
however you like.
## Any assertions will do!
20 > -20
[1] TRUE
"black" == "white"
[1] FALSE
## Or be cheeky about it...
TRUE
[1] TRUE
FALSE
[1] FALSE
Data
Run this code chunk to read in the data into your Environment.
<- readr::read_csv("data/anx_data.csv") anx_data
Codebook
Run this code chunk to open the Codebook in the Viewer tab on Posit Cloud, or open the Codebook in a browser here.
::rstudio_viewer("skills_lab_02_codebook.html", "data") ricomisc
Data Inspection
Inspect your dataset, and compare it to the Codebook. Identify any discrepancies between the two, using the Codebook as a guide for what your dataset should look like, how it needs to be cleaned, and what variables it should contain.
Then, make a list below of the steps you must take to clean or change your dataset so that it matches the Codebook. Consider questions like:
- Are the variables the right type?
- Do they contain the right information?
- Do any variables need to be created or recoded?
- Does the codebook mention any responses that should be excluded?
Note: For now, you can treat character data as categorical. (We’ll get into this more in second year!)
View the Dataset
Open the dataset in a new tab (recommended).
View(anx_data)
Check a quick summary of each variable.
summary(anx_data)
id distribution consent gender
Length:465 Length:465 Length:465 Min. :1.000
Class :character Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Mode :character Median :1.000
Mean :1.647
3rd Qu.:2.000
Max. :4.000
age test_anxiety_score part_anxiety_score sticsa_trait_score
Min. : 16.00 Min. :1.125 Min. :1.000 Min. :1.000
1st Qu.: 18.00 1st Qu.:2.875 1st Qu.:2.000 1st Qu.:1.714
Median : 19.00 Median :3.500 Median :2.545 Median :2.190
Mean : 22.72 Mean :3.421 Mean :2.603 Mean :2.208
3rd Qu.: 20.00 3rd Qu.:4.000 3rd Qu.:3.182 3rd Qu.:2.619
Max. :230.00 Max. :5.000 Max. :5.000 Max. :3.905
To-Do List
- Calculate a new variable:
ac_anxiety_score
is described in the Codebook but does not exist in the dataset. It should be created by taking the mean of the other two academic anxiety score variables. - Recode a variable:
sticsa_trait_cat
is described in the Codebook but does not exist in the dataset. It should be created by splitting the STICSA trait score into “clinical” and “non-clinical” categories using a cutoff from the literature. - Relabel a variable:
gender
is not formatted correctly (should be character, not numeric). - Keep correct cases: remove incorrect or ethically inadmissable cases.
In the session, I mentioned “keeping a record of your code”, or “documenting code”. What does this mean?
There are two different things this can refer to:
First, although it seems obvious, writing down all of your code in full and in order. Especially when you are trying to solve a thorny problem or debug a baffling error, it’s easy and common to try lots of different solutions and fixes before landing on a good resolution. It’s really essential to clean up your code at this point, remove any redundant code and dead ends, and ensure the full and correct solution is included in your document so that it works seamlessly with the rest of your code. If you don’t do this, you can end up in the situation I described: I’d somehow introduced a mistake in my data that I couldn’t track down, because I hadn’t recorded the code I used. I simply had to throw out all the work I’d done, go back to before the problem, and work through all the steps again. Learn from my mistakes!
Second, annotating your code with comments. This is a massive favour to Future You as well as anyone else who might want to understand or use your code in the future. Your commenting can be as detailed as you like, but keep in mind that even though you may understand your code now, it won’t be so obvious even a day or week later. Personally, I typically comment each code section, so I know the overall purpose and result of each major bit of code; and I also often put comments on particular lines that are difficult to read or understand quickly, so that I don’t have to puzzle through what they mean later.
Here’s a genuine example, with real comments, from some code I wrote recently to create pages on Canvas with R. (Don’t worry about understanding what the code does - this is only about the comments!) The first comment ## Figure out which pages...
tells me what the subsequent commands (are meant to) do, and the in-line comments tell me about individual lines of code and why I included them. This is often because I had to make changes to the code I wrote at first to solve unexpected problems, and I want to remember why that was necessary to make things easier in the future.
```{r}
## Figure out which pages should be published/generated
pages <- cnvs::get_pages(module_id)
these_pages <- term_times |>
dplyr::rowwise() |>
## Only generate pages that should be published AND don't already exist
dplyr::filter(publish_page == TRUE &
!any(grepl(page_url, gsub("-", "_", pages$url)))) ## gsub because Canvas uses hyphens
```
Tasks
Calculate a New Variable
The variable ac_anxiety_score
needs to be calculated from other variables in the dataset. Specifically, it is the mean of the test and participation anxiety scores. Since the mean is calculated as the sum of all values divided by the number of values, we can write this out as an expression using mutate()
.
::mutate(
dplyr
anx_data,ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
# A tibble: 465 × 9
id distribution consent gender age test_anxiety_score part_anxiety_score
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1U3uo preview Yes 1 19 3.62 2.45
2 o0nLV preview Yes 1 19 1.38 1.09
3 KQSrs preview <NA> 2 21 4.12 3.27
4 GwzPE preview Yes 3 19 4.25 3.36
5 VAzZ4 preview Yes 1 19 3.12 2.18
6 UJPCB preview Yes 1 18 4.88 4.18
7 nRG2C preview <NA> 1 20 2.25 2.55
8 UfHKg preview Yes 4 18 4.88 3.36
9 D09VK preview Yes 1 20 4.38 2
10 ulrFC preview Yes 1 18 4 2.55
# ℹ 455 more rows
# ℹ 2 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>
Once we’re happy that the calculation has been done correctly, we should assign our changes to the dataset. To do this, we add anx_data <-
at the very beginning of our command. This will overwrite the existing anx_data
dataset with the new version we created that includes the new ac_anxiety_score
variable.
<- dplyr::mutate(
anx_data
anx_data,ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
This last assigning step is essential, but easy to forget! If you don’t assign your changes to anything, they won’t take effect in your dataset.
For every data wrangling task like this one:
- Identify what needs doing
- Write code to do it
- Check the output carefully to confirm your code does exactly and only what you wanted
- Assign the output to “save” your changes.
Recode a Variable
One of the variables that this dataset contains is called sticsa_trait_score
, which is a measure of trait anxiety from the State-Trait Inventory for Cognitive and Somatic Anxiety (Ree et al., 2008). The Codebook also contains a categorical version of this variable, indicating whether or not participants display clinical levels of anxiety. We can recode the sticsa_trait_score
variable into a new sticsa_trait_cat
variable with two categories, “clinical” and “non-clinical”, using 2.047619 as the cutoff indicating clinical levels of anxiety (Van Dam et al., 2013).
For this task, we want to apply different operations for different rows, so we use dplyr::case_when()
to decide how to do this.
1<- 2.047619
anxiety_cutoff
2<- dplyr::mutate(
anx_data 3
anx_data,4sticsa_trait_cat = dplyr::case_when(
5>= anxiety_cutoff ~ "clinical",
sticsa_trait_score 6< anxiety_cutoff ~ "non-clinical",
sticsa_trait_score 7.default = NA
) )
- 1
-
Optionally, create a new object,
anxiety_cutoff
, containing the cutoff value for separating clinical from non-clinical anxiety. (See the “MoRe About” box for more on this!) - 2
-
Overwrite the
anx_data
object by making a change.. - 3
-
…To the
anx_data
dataset by… - 4
-
Creating a new variable,
sticsa_trait_cat
, by applying the following rules: - 5
-
For cases where the value of
sticsa_trait_score
is greater than or equal toanxiety_cutoff
, assign the value “clinical” tosticsa_trait_cat
- 6
-
For cases where the value of
sticsa_trait_score
is less thananxiety_cutoff
, assign the value “non-clinical” tosticsa_trait_cat
- 7
-
For cases that don’t match any of the preceding criteria, assign
NA
tosticsa_trait_cat
You may have written something like this, which (in this case) will also work the same way:
<- dplyr::mutate(
anx_data
anx_data,sticsa_trait_cat = dplyr::case_when(
>= 2.047619 ~ "clinical",
sticsa_trait_score .default = "non-clinical"
) )
The issue with this code is that it’s slightly “unsafe”. What does that mean? Primarily, it means that the code might do something you don’t want without you realising it. This code will assign the value “clinical” for any value greater than or equal to the cutoff, which is fine, but will assign “non-clinical” to any other value. If a participant, for example, was so anxious that they hadn’t answered the questions, so they had a missing value (NA
) for sticsa_trait_score
, they would still be assigned “non-clinical” - which is probably not what we want!
The first solution above is “safer” because it instead explicitly defines the score range for the two groups, then assigns NA
as the default, so if anyone refuses to answer or otherwise has a strange or missing value, they’ll be assigned a missing value instead of one of the two categories. This is a safeguard to prevent us introducing mistakes into our data.
In the code above, the cutoff value is stored in a new object, anxiety_cutoff
, which is then used in the subsequent case_when()
conditions. Why take this extra step?
This is a matter of style, since the output of this code would be entirely identical if I wrote the cutoff value into the case_when()
assertions directly (e.g. sticsa_trait_score >= 2.047619
). I have done it this way for a few reasons:
- The cutoff value is easy to find in the code, in case I need to remind myself which one I used, and it’s clearly named, so I know what it represents.
- The cutoff value only needs to be typed in once, rather than copy/pasted or typed out multiple times, which decreases the risk of typos or errors.
- Most importantly, it’s easy to change, in case I need to update it later. (Say, for example, a new paper is published that introduces a different cutoff value, and I want to use the new value instead.) I would only have to change the value in the
anxiety_cutoff
object once, at the beginning of the code chunk, and all of the subsequent code using that object would be similarly updated. Otherwise, I would have to find and replace all the places in my code where the number2.047619
- which is a great way to miss something!
In short, using a clearly named object to store this value makes the code easier to navigate, more resilient to later updates, and more transparent in its meaning.
Relabel a Variable
The gender
variable currently contains number codes. Let’s recode them into labels so that when we create models or visualisations, they’ll be easy to read.
For this task, we again want to apply different operations for different rows, so we use dplyr::case_when()
to decide how to do this.
<- dplyr::mutate(
anx_data
anx_data,gender = dplyr::case_when(
== 1 ~ "female",
gender == 2 ~ "male",
gender == 3 ~ "non-binary",
gender .default = "other/pnts"
) )
Keep Correct Cases
We didn’t get to this in the live session, but it’s also review from last week. The values we want to filter on are from the Codebook. Make sure you understand why this command works, and let us know in a drop-in or practical if you’d like any help.
Finally, we need to keep only real cases that can ethically participate. These are cases that:
distribution
is “anonymous” (not “preview”)consent
is “Yes” (not any other value)age
is between 18 and a reasonable upper limit (e.g. 99)
<- dplyr::filter(
anx_data_final
anx_data,== "anonymous" & consent == "Yes" & dplyr::between(age, 18, 99)
distribution )