library(tidyverse)
Skills Lab 02: Data cleaning
You can access the Skills Lab project for Week 2 on Posit Cloud.
Check the Analysing Data Panopto page for recordings or the main Posit Cloud page for other materials.
Join the Google Doc during the live session here: https://rebrand.ly/and_skills_lab_02
Setup
Packages
Revision: Logical Assertions
We’ll be making use again of logical assertions. These are statements that can be either TRUE
or FALSE
- they return logical data.
Get R to return TRUE
and FALSE
however you like.
## Any assertions will do!
20 > -20
[1] TRUE
"black" == "white"
[1] FALSE
## Or be cheeky about it...
TRUE
[1] TRUE
FALSE
[1] FALSE
Data
Run this code chunk to read in the data into your Environment.
<- readr::read_csv("data/anx_data.csv") anx_data
Codebook
Run this code chunk to open the Codebook in the Viewer tab on Posit Cloud, or open the Codebook in a browser here.
::rstudio_viewer("skills_lab_02_codebook.html", "data") ricomisc
Data Inspection
Inspect your dataset, and compare it to the Codebook. Identify any discrepancies between the two, using the Codebook as a guide for what your dataset should look like, how it needs to be cleaned, and what variables it should contain.
Then, make a list below of the steps you must take to clean or change your dataset so that it matches the Codebook. Consider questions like:
- Are the variables the right type?
- Do they contain the right information?
- Do any variables need to be created or recoded?
- Does the codebook mention any responses that should be excluded?
Note: For now, you can treat character data as categorical. (We’ll get into this more in second year!)
View the Dataset
Open the dataset in a new tab (recommended).
View(anx_data)
Check a quick summary of each variable.
summary(anx_data)
id distribution consent gender
Length:465 Length:465 Length:465 Min. :1.000
Class :character Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Mode :character Median :1.000
Mean :1.647
3rd Qu.:2.000
Max. :4.000
age test_anxiety_score part_anxiety_score sticsa_trait_score
Min. : 16.00 Min. :1.125 Min. :1.000 Min. :1.000
1st Qu.: 18.00 1st Qu.:2.875 1st Qu.:2.000 1st Qu.:1.714
Median : 19.00 Median :3.500 Median :2.545 Median :2.190
Mean : 22.72 Mean :3.421 Mean :2.603 Mean :2.208
3rd Qu.: 20.00 3rd Qu.:4.000 3rd Qu.:3.182 3rd Qu.:2.619
Max. :230.00 Max. :5.000 Max. :5.000 Max. :3.905
To-Do List
- Calculate a new variable:
ac_anxiety_score
is described in the Codebook but does not exist in the dataset. It should be created by taking the mean of the other two academic anxiety score variables. - Recode a variable:
sticsa_trait_cat
is described in the Codebook but does not exist in the dataset. It should be created by splitting the STICSA trait score into “clinical” and “non-clinical” categories using a cutoff from the literature. - Relabel a variable:
gender
is not formatted correctly (should be character, not numeric). - Keep correct cases: remove incorrect or ethically inadmissable cases.
Tasks
Calculate a New Variable
The variable ac_anxiety_score
needs to be calculated from other variables in the dataset. Specifically, it is the mean of the test and participation anxiety scores. Since the mean is calculated as the sum of all values divided by the number of values, we can write this out as an expression using mutate()
.
::mutate(
dplyr
anx_data,ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
# A tibble: 465 × 9
id distribution consent gender age test_anxiety_score part_anxiety_score
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1U3uo preview Yes 1 19 3.62 2.45
2 o0nLV preview Yes 1 19 1.38 1.09
3 KQSrs preview <NA> 2 21 4.12 3.27
4 GwzPE preview Yes 3 19 4.25 3.36
5 VAzZ4 preview Yes 1 19 3.12 2.18
6 UJPCB preview Yes 1 18 4.88 4.18
7 nRG2C preview <NA> 1 20 2.25 2.55
8 UfHKg preview Yes 4 18 4.88 3.36
9 D09VK preview Yes 1 20 4.38 2
10 ulrFC preview Yes 1 18 4 2.55
# ℹ 455 more rows
# ℹ 2 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>
Once we’re happy that the calculation has been done correctly, we should assign our changes to the dataset.
<- dplyr::mutate(
anx_data
anx_data,ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
Recode a Variable
One of the variables that this dataset contains is called sticsa_trait_score
, which is a measure of trait anxiety from the State-Trait Inventory for Cognitive and Somatic Anxiety (Ree et al., 2008). The Codebook also contains a categorical version of this variable, indicating whether or not participants display clinical levels of anxiety. We can recode the sticsa_trait_score
variable into a new sticsa_trait_cat
variable with two categories, “clinical” and “non-clinical”, using 2.047619 as the cutoff indicating clinical levels of anxiety (Van Dam et al., 2013).
For this task, we want to apply different operations for different rows, so we use dplyr::case_when()
to decide how to do this.
1<- 2.047619
anxiety_cutoff
2<- dplyr::mutate(
anx_data 3
anx_data,4sticsa_trait_cat = dplyr::case_when(
5>= anxiety_cutoff ~ "clinical",
sticsa_trait_score 6< anxiety_cutoff ~ "non-clinical",
sticsa_trait_score 7.default = NA
) )
- 1
-
Optionally, create a new object,
anxiety_cutoff
, containing the cutoff value for separating clinical from non-clinical anxiety. (See the “MoRe About” box for more on this!) - 2
-
Overwrite the
anx_data
object by making a change.. - 3
-
…To the
anx_data
dataset by… - 4
-
Creating a new variable,
sticsa_trait_cat
, by applying the following rules: - 5
-
For cases where the value of
sticsa_trait_score
is greater than or equal toanxiety_cutoff
, assign the value “clinical” tosticsa_trait_cat
- 6
-
For cases where the value of
sticsa_trait_score
is less thananxiety_cutoff
, assign the value “non-clinical” tosticsa_trait_cat
- 7
-
For cases that don’t match any of the preceding criteria, assign
NA
tosticsa_trait_cat
You may have written something like this, which (in this case) will also work the same way:
<- dplyr::mutate(
anx_data
anx_data,sticsa_trait_cat = dplyr::case_when(
>= 2.047619 ~ "clinical",
sticsa_trait_score .default = "non-clinical"
) )
The issue with this code is that it’s slightly “unsafe”. What does that mean? Primarily, it means that the code might do something you don’t want without you realising it. This code will assign the value “clinical” for any value greater than or equal to the cutoff, which is fine, but will assign “non-clinical” to any other value. If a participant, for example, was so anxious that they hadn’t answered the questions, so they had a missing value (NA
) for sticsa_trait_score
, they would still be assigned “non-clinical” - which is probably not what we want!
The first solution above is “safer” because it instead explicitly defines the score range for the two groups, then assigns NA
as the default, so if anyone refuses to answer or otherwise has a strange or missing value, they’ll be assigned a missing value instead of one of the two categories. This is a safeguard to prevent us introducing mistakes into our data.
In the code above, the cutoff value is stored in a new object, anxiety_cutoff
, which is then used in the subsequent case_when()
conditions. Why take this extra step?
This is a matter of style, since the output of this code would be entirely identical if I wrote the cutoff value into the case_when()
assertions directly (e.g. sticsa_trait_score >= 2.047619
). I have done it this way for a few reasons:
- The cutoff value is easy to find in the code, in case I need to remind myself which one I used, and it’s clearly named, so I know what it represents.
- The cutoff value only needs to be typed in once, rather than copy/pasted or typed out multiple times, which decreases the risk of typos or errors.
- Most importantly, it’s easy to change, in case I need to update it later. I would only have to change the value in the
anxiety_cutoff
object once, at the beginning of the code chunk, and all of the subsequent code using that object would be similarly updated.
In short, it makes the code easier to navigate, more resilient to later updates, and more transparent in its meaning.
Relabel a Variable
The gender
variable currently contains number codes. Let’s recode them into labels so that when we create models or visualisations, they’ll be easy to read.
For this task, we again want to apply different operations for different rows, so we use dplyr::case_when()
to decide how to do this.
::mutate(
dplyr
anx_data,gender = dplyr::case_when(
== 1 ~ "female",
gender == 2 ~ "male",
gender == 3 ~ "non-binary",
gender .default = "other/pnts"
) )
# A tibble: 465 × 10
id distribution consent gender age test_anxiety_score part_anxiety_score
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1U3uo preview Yes female 19 3.62 2.45
2 o0nLV preview Yes female 19 1.38 1.09
3 KQSrs preview <NA> male 21 4.12 3.27
4 GwzPE preview Yes non-b… 19 4.25 3.36
5 VAzZ4 preview Yes female 19 3.12 2.18
6 UJPCB preview Yes female 18 4.88 4.18
7 nRG2C preview <NA> female 20 2.25 2.55
8 UfHKg preview Yes other… 18 4.88 3.36
9 D09VK preview Yes female 20 4.38 2
10 ulrFC preview Yes female 18 4 2.55
# ℹ 455 more rows
# ℹ 3 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>,
# sticsa_trait_cat <chr>
Keep Correct Cases
Finally, we need to keep only real cases that can ethically participate. These are cases that:
distribution
is “anonymous” (not “preview”)consent
is “Yes” (not any other value)age
is between 18 and a reasonable upper limit (e.g. 99)
<- dplyr::filter(
anx_data_final
anx_data,== "anonymous" & consent == "Yes" & dplyr::between(age, 18, 99)
distribution )