Skills Lab 02: Data cleaning

Author

JM

You can access the Skills Lab project for Week 2 on Posit Cloud.

Check the Analysing Data Panopto page for recordings or the main Posit Cloud page for other materials.

Join the Google Doc during the live session here: https://rebrand.ly/and_skills_lab_02

Setup

Packages

library(tidyverse)

Revision: Logical Assertions

We’ll be making use again of logical assertions. These are statements that can be either TRUE or FALSE - they return logical data.

Get R to return TRUE and FALSE however you like.

## Any assertions will do!
20 > -20
[1] TRUE
"black" == "white"
[1] FALSE
## Or be cheeky about it...
TRUE
[1] TRUE
FALSE
[1] FALSE

Data

Run this code chunk to read in the data into your Environment.

anx_data <- readr::read_csv("data/anx_data.csv")

Codebook

Run this code chunk to open the Codebook in the Viewer tab on Posit Cloud, or open the Codebook in a browser here.

ricomisc::rstudio_viewer("skills_lab_02_codebook.html", "data")

Data Inspection

Inspect your dataset, and compare it to the Codebook. Identify any discrepancies between the two, using the Codebook as a guide for what your dataset should look like, how it needs to be cleaned, and what variables it should contain.

Then, make a list below of the steps you must take to clean or change your dataset so that it matches the Codebook. Consider questions like:

  • Are the variables the right type?
  • Do they contain the right information?
  • Do any variables need to be created or recoded?
  • Does the codebook mention any responses that should be excluded?

Note: For now, you can treat character data as categorical. (We’ll get into this more in second year!)

View the Dataset

Open the dataset in a new tab (recommended).

View(anx_data)

Check a quick summary of each variable.

summary(anx_data)
      id            distribution         consent              gender     
 Length:465         Length:465         Length:465         Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Mode  :character   Median :1.000  
                                                          Mean   :1.647  
                                                          3rd Qu.:2.000  
                                                          Max.   :4.000  
      age         test_anxiety_score part_anxiety_score sticsa_trait_score
 Min.   : 16.00   Min.   :1.125      Min.   :1.000      Min.   :1.000     
 1st Qu.: 18.00   1st Qu.:2.875      1st Qu.:2.000      1st Qu.:1.714     
 Median : 19.00   Median :3.500      Median :2.545      Median :2.190     
 Mean   : 22.72   Mean   :3.421      Mean   :2.603      Mean   :2.208     
 3rd Qu.: 20.00   3rd Qu.:4.000      3rd Qu.:3.182      3rd Qu.:2.619     
 Max.   :230.00   Max.   :5.000      Max.   :5.000      Max.   :3.905     

To-Do List

  • Calculate a new variable: ac_anxiety_score is described in the Codebook but does not exist in the dataset. It should be created by taking the mean of the other two academic anxiety score variables.
  • Recode a variable: sticsa_trait_cat is described in the Codebook but does not exist in the dataset. It should be created by splitting the STICSA trait score into “clinical” and “non-clinical” categories using a cutoff from the literature.
  • Relabel a variable: gender is not formatted correctly (should be character, not numeric).
  • Keep correct cases: remove incorrect or ethically inadmissable cases.

Tasks

Calculate a New Variable

The variable ac_anxiety_score needs to be calculated from other variables in the dataset. Specifically, it is the mean of the test and participation anxiety scores. Since the mean is calculated as the sum of all values divided by the number of values, we can write this out as an expression using mutate().

dplyr::mutate(
  anx_data,
  ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
# A tibble: 465 × 9
   id    distribution consent gender   age test_anxiety_score part_anxiety_score
   <chr> <chr>        <chr>    <dbl> <dbl>              <dbl>              <dbl>
 1 1U3uo preview      Yes          1    19               3.62               2.45
 2 o0nLV preview      Yes          1    19               1.38               1.09
 3 KQSrs preview      <NA>         2    21               4.12               3.27
 4 GwzPE preview      Yes          3    19               4.25               3.36
 5 VAzZ4 preview      Yes          1    19               3.12               2.18
 6 UJPCB preview      Yes          1    18               4.88               4.18
 7 nRG2C preview      <NA>         1    20               2.25               2.55
 8 UfHKg preview      Yes          4    18               4.88               3.36
 9 D09VK preview      Yes          1    20               4.38               2   
10 ulrFC preview      Yes          1    18               4                  2.55
# ℹ 455 more rows
# ℹ 2 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>

Once we’re happy that the calculation has been done correctly, we should assign our changes to the dataset.

anx_data <- dplyr::mutate(
  anx_data,
  ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)

Recode a Variable

One of the variables that this dataset contains is called sticsa_trait_score, which is a measure of trait anxiety from the State-Trait Inventory for Cognitive and Somatic Anxiety (Ree et al., 2008). The Codebook also contains a categorical version of this variable, indicating whether or not participants display clinical levels of anxiety. We can recode the sticsa_trait_score variable into a new sticsa_trait_cat variable with two categories, “clinical” and “non-clinical”, using 2.047619 as the cutoff indicating clinical levels of anxiety (Van Dam et al., 2013).

For this task, we want to apply different operations for different rows, so we use dplyr::case_when() to decide how to do this.

1anxiety_cutoff <- 2.047619

2anx_data <- dplyr::mutate(
3  anx_data,
4    sticsa_trait_cat = dplyr::case_when(
5      sticsa_trait_score >= anxiety_cutoff ~ "clinical",
6      sticsa_trait_score < anxiety_cutoff ~ "non-clinical",
7      .default = NA
    )
  )
1
Optionally, create a new object, anxiety_cutoff, containing the cutoff value for separating clinical from non-clinical anxiety. (See the “MoRe About” box for more on this!)
2
Overwrite the anx_data object by making a change..
3
…To the anx_data dataset by…
4
Creating a new variable, sticsa_trait_cat, by applying the following rules:
5
For cases where the value of sticsa_trait_score is greater than or equal to anxiety_cutoff, assign the value “clinical” to sticsa_trait_cat
6
For cases where the value of sticsa_trait_score is less than anxiety_cutoff, assign the value “non-clinical” to sticsa_trait_cat
7
For cases that don’t match any of the preceding criteria, assign NA to sticsa_trait_cat
MoRe About: Splitting Variables

You may have written something like this, which (in this case) will also work the same way:

anx_data <- dplyr::mutate(
  anx_data,
    sticsa_trait_cat = dplyr::case_when(
      sticsa_trait_score >= 2.047619 ~ "clinical",
      .default = "non-clinical"
    )
  )

The issue with this code is that it’s slightly “unsafe”. What does that mean? Primarily, it means that the code might do something you don’t want without you realising it. This code will assign the value “clinical” for any value greater than or equal to the cutoff, which is fine, but will assign “non-clinical” to any other value. If a participant, for example, was so anxious that they hadn’t answered the questions, so they had a missing value (NA) for sticsa_trait_score, they would still be assigned “non-clinical” - which is probably not what we want!

The first solution above is “safer” because it instead explicitly defines the score range for the two groups, then assigns NA as the default, so if anyone refuses to answer or otherwise has a strange or missing value, they’ll be assigned a missing value instead of one of the two categories. This is a safeguard to prevent us introducing mistakes into our data.

In the code above, the cutoff value is stored in a new object, anxiety_cutoff, which is then used in the subsequent case_when() conditions. Why take this extra step?

This is a matter of style, since the output of this code would be entirely identical if I wrote the cutoff value into the case_when() assertions directly (e.g. sticsa_trait_score >= 2.047619). I have done it this way for a few reasons:

  1. The cutoff value is easy to find in the code, in case I need to remind myself which one I used, and it’s clearly named, so I know what it represents.
  2. The cutoff value only needs to be typed in once, rather than copy/pasted or typed out multiple times, which decreases the risk of typos or errors.
  3. Most importantly, it’s easy to change, in case I need to update it later. I would only have to change the value in the anxiety_cutoff object once, at the beginning of the code chunk, and all of the subsequent code using that object would be similarly updated.

In short, it makes the code easier to navigate, more resilient to later updates, and more transparent in its meaning.

Relabel a Variable

The gender variable currently contains number codes. Let’s recode them into labels so that when we create models or visualisations, they’ll be easy to read.

For this task, we again want to apply different operations for different rows, so we use dplyr::case_when() to decide how to do this.

dplyr::mutate(
  anx_data,
  gender = dplyr::case_when(
    gender == 1 ~ "female",
    gender == 2 ~ "male",
    gender == 3 ~ "non-binary",
    .default = "other/pnts"
  )
)
# A tibble: 465 × 10
   id    distribution consent gender   age test_anxiety_score part_anxiety_score
   <chr> <chr>        <chr>   <chr>  <dbl>              <dbl>              <dbl>
 1 1U3uo preview      Yes     female    19               3.62               2.45
 2 o0nLV preview      Yes     female    19               1.38               1.09
 3 KQSrs preview      <NA>    male      21               4.12               3.27
 4 GwzPE preview      Yes     non-b…    19               4.25               3.36
 5 VAzZ4 preview      Yes     female    19               3.12               2.18
 6 UJPCB preview      Yes     female    18               4.88               4.18
 7 nRG2C preview      <NA>    female    20               2.25               2.55
 8 UfHKg preview      Yes     other…    18               4.88               3.36
 9 D09VK preview      Yes     female    20               4.38               2   
10 ulrFC preview      Yes     female    18               4                  2.55
# ℹ 455 more rows
# ℹ 3 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>,
#   sticsa_trait_cat <chr>

Keep Correct Cases

Finally, we need to keep only real cases that can ethically participate. These are cases that:

  • distribution is “anonymous” (not “preview”)
  • consent is “Yes” (not any other value)
  • age is between 18 and a reasonable upper limit (e.g. 99)
anx_data_final <- dplyr::filter(
  anx_data,
  distribution == "anonymous" & consent == "Yes" & dplyr::between(age, 18, 99)
)

Kahoot! Time