Skills Lab 02: Data cleaning

Author

JM

Setup

Packages

library(tidyverse)

Revision: Logical Assertions

We’ll be making use again of logical assertions. These are statements that can be either TRUE or FALSE - they return logical data.

Get R to return TRUE and FALSE however you like.

## Any assertions will do!
20 > -20
[1] TRUE
"black" == "white"
[1] FALSE
## Or be cheeky about it...
TRUE
[1] TRUE
FALSE
[1] FALSE

Data

Run this code chunk to read in the data into your Environment.

anx_data <- readr::read_csv("data/anx_data.csv")

Codebook

Run this code chunk to open the Codebook in the Viewer tab on Posit Cloud, or open the Codebook in a browser here.

ricomisc::rstudio_viewer("skills_lab_02_codebook.html", "data")

Data Inspection

Inspect your dataset, and compare it to the Codebook. Identify any discrepancies between the two, using the Codebook as a guide for what your dataset should look like, how it needs to be cleaned, and what variables it should contain.

Then, make a list below of the steps you must take to clean or change your dataset so that it matches the Codebook. Consider questions like:

  • Are the variables the right type?
  • Do they contain the right information?
  • Do any variables need to be created or recoded?
  • Does the codebook mention any responses that should be excluded?

Note: For now, you can treat character data as categorical. (We’ll get into this more in second year!)

View the Dataset

Open the dataset in a new tab (recommended).

View(anx_data)

Check a quick summary of each variable.

summary(anx_data)
      id            distribution         consent              gender     
 Length:465         Length:465         Length:465         Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Mode  :character   Median :1.000  
                                                          Mean   :1.647  
                                                          3rd Qu.:2.000  
                                                          Max.   :4.000  
      age         test_anxiety_score part_anxiety_score sticsa_trait_score
 Min.   : 16.00   Min.   :1.125      Min.   :1.000      Min.   :1.000     
 1st Qu.: 18.00   1st Qu.:2.875      1st Qu.:2.000      1st Qu.:1.714     
 Median : 19.00   Median :3.500      Median :2.545      Median :2.190     
 Mean   : 22.72   Mean   :3.421      Mean   :2.603      Mean   :2.208     
 3rd Qu.: 20.00   3rd Qu.:4.000      3rd Qu.:3.182      3rd Qu.:2.619     
 Max.   :230.00   Max.   :5.000      Max.   :5.000      Max.   :3.905     

To-Do List

  • Calculate a new variable: ac_anxiety_score is described in the Codebook but does not exist in the dataset. It should be created by taking the mean of the other two academic anxiety score variables.
  • Recode a variable: sticsa_trait_cat is described in the Codebook but does not exist in the dataset. It should be created by splitting the STICSA trait score into “clinical” and “non-clinical” categories using a cutoff from the literature.
  • Relabel a variable: gender is not formatted correctly (should be character, not numeric).
  • Keep correct cases: remove incorrect or ethically inadmissable cases.

In the session, I mentioned “keeping a record of your code”, or “documenting code”. What does this mean?

There are two different things this can refer to:

First, although it seems obvious, writing down all of your code in full and in order. Especially when you are trying to solve a thorny problem or debug a baffling error, it’s easy and common to try lots of different solutions and fixes before landing on a good resolution. It’s really essential to clean up your code at this point, remove any redundant code and dead ends, and ensure the full and correct solution is included in your document so that it works seamlessly with the rest of your code. If you don’t do this, you can end up in the situation I described: I’d somehow introduced a mistake in my data that I couldn’t track down, because I hadn’t recorded the code I used. I simply had to throw out all the work I’d done, go back to before the problem, and work through all the steps again. Learn from my mistakes!

Second, annotating your code with comments. This is a massive favour to Future You as well as anyone else who might want to understand or use your code in the future. Your commenting can be as detailed as you like, but keep in mind that even though you may understand your code now, it won’t be so obvious even a day or week later. Personally, I typically comment each code section, so I know the overall purpose and result of each major bit of code; and I also often put comments on particular lines that are difficult to read or understand quickly, so that I don’t have to puzzle through what they mean later.

Here’s a genuine example, with real comments, from some code I wrote recently to create pages on Canvas with R. (Don’t worry about understanding what the code does - this is only about the comments!) The first comment ## Figure out which pages... tells me what the subsequent commands (are meant to) do, and the in-line comments tell me about individual lines of code and why I included them. This is often because I had to make changes to the code I wrote at first to solve unexpected problems, and I want to remember why that was necessary to make things easier in the future.

```{r}
## Figure out which pages should be published/generated
pages <- cnvs::get_pages(module_id)

these_pages <- term_times |>
  dplyr::rowwise() |> 
  ## Only generate pages that should be published AND don't already exist
  dplyr::filter(publish_page == TRUE &
                  !any(grepl(page_url, gsub("-", "_", pages$url)))) ## gsub because Canvas uses hyphens
```

Tasks

Calculate a New Variable

The variable ac_anxiety_score needs to be calculated from other variables in the dataset. Specifically, it is the mean of the test and participation anxiety scores. Since the mean is calculated as the sum of all values divided by the number of values, we can write this out as an expression using mutate().

dplyr::mutate(
  anx_data,
  ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
# A tibble: 465 × 9
   id    distribution consent gender   age test_anxiety_score part_anxiety_score
   <chr> <chr>        <chr>    <dbl> <dbl>              <dbl>              <dbl>
 1 1U3uo preview      Yes          1    19               3.62               2.45
 2 o0nLV preview      Yes          1    19               1.38               1.09
 3 KQSrs preview      <NA>         2    21               4.12               3.27
 4 GwzPE preview      Yes          3    19               4.25               3.36
 5 VAzZ4 preview      Yes          1    19               3.12               2.18
 6 UJPCB preview      Yes          1    18               4.88               4.18
 7 nRG2C preview      <NA>         1    20               2.25               2.55
 8 UfHKg preview      Yes          4    18               4.88               3.36
 9 D09VK preview      Yes          1    20               4.38               2   
10 ulrFC preview      Yes          1    18               4                  2.55
# ℹ 455 more rows
# ℹ 2 more variables: sticsa_trait_score <dbl>, ac_anxiety_score <dbl>

Once we’re happy that the calculation has been done correctly, we should assign our changes to the dataset. To do this, we add anx_data <- at the very beginning of our command. This will overwrite the existing anx_data dataset with the new version we created that includes the new ac_anxiety_score variable.

anx_data <- dplyr::mutate(
  anx_data,
  ac_anxiety_score = (test_anxiety_score + part_anxiety_score)/2
)
Important

This last assigning step is essential, but easy to forget! If you don’t assign your changes to anything, they won’t take effect in your dataset.

For every data wrangling task like this one:

  • Identify what needs doing
  • Write code to do it
  • Check the output carefully to confirm your code does exactly and only what you wanted
  • Assign the output to “save” your changes.

Recode a Variable

One of the variables that this dataset contains is called sticsa_trait_score, which is a measure of trait anxiety from the State-Trait Inventory for Cognitive and Somatic Anxiety (Ree et al., 2008). The Codebook also contains a categorical version of this variable, indicating whether or not participants display clinical levels of anxiety. We can recode the sticsa_trait_score variable into a new sticsa_trait_cat variable with two categories, “clinical” and “non-clinical”, using 2.047619 as the cutoff indicating clinical levels of anxiety (Van Dam et al., 2013).

For this task, we want to apply different operations for different rows, so we use dplyr::case_when() to decide how to do this.

1anxiety_cutoff <- 2.047619

2anx_data <- dplyr::mutate(
3  anx_data,
4    sticsa_trait_cat = dplyr::case_when(
5      sticsa_trait_score >= anxiety_cutoff ~ "clinical",
6      sticsa_trait_score < anxiety_cutoff ~ "non-clinical",
7      .default = NA
    )
  )
1
Optionally, create a new object, anxiety_cutoff, containing the cutoff value for separating clinical from non-clinical anxiety. (See the “MoRe About” box for more on this!)
2
Overwrite the anx_data object by making a change..
3
…To the anx_data dataset by…
4
Creating a new variable, sticsa_trait_cat, by applying the following rules:
5
For cases where the value of sticsa_trait_score is greater than or equal to anxiety_cutoff, assign the value “clinical” to sticsa_trait_cat
6
For cases where the value of sticsa_trait_score is less than anxiety_cutoff, assign the value “non-clinical” to sticsa_trait_cat
7
For cases that don’t match any of the preceding criteria, assign NA to sticsa_trait_cat
MoRe About: Splitting Variables

You may have written something like this, which (in this case) will also work the same way:

anx_data <- dplyr::mutate(
  anx_data,
    sticsa_trait_cat = dplyr::case_when(
      sticsa_trait_score >= 2.047619 ~ "clinical",
      .default = "non-clinical"
    )
  )

The issue with this code is that it’s slightly “unsafe”. What does that mean? Primarily, it means that the code might do something you don’t want without you realising it. This code will assign the value “clinical” for any value greater than or equal to the cutoff, which is fine, but will assign “non-clinical” to any other value. If a participant, for example, was so anxious that they hadn’t answered the questions, so they had a missing value (NA) for sticsa_trait_score, they would still be assigned “non-clinical” - which is probably not what we want!

The first solution above is “safer” because it instead explicitly defines the score range for the two groups, then assigns NA as the default, so if anyone refuses to answer or otherwise has a strange or missing value, they’ll be assigned a missing value instead of one of the two categories. This is a safeguard to prevent us introducing mistakes into our data.

In the code above, the cutoff value is stored in a new object, anxiety_cutoff, which is then used in the subsequent case_when() conditions. Why take this extra step?

This is a matter of style, since the output of this code would be entirely identical if I wrote the cutoff value into the case_when() assertions directly (e.g. sticsa_trait_score >= 2.047619). I have done it this way for a few reasons:

  1. The cutoff value is easy to find in the code, in case I need to remind myself which one I used, and it’s clearly named, so I know what it represents.
  2. The cutoff value only needs to be typed in once, rather than copy/pasted or typed out multiple times, which decreases the risk of typos or errors.
  3. Most importantly, it’s easy to change, in case I need to update it later. (Say, for example, a new paper is published that introduces a different cutoff value, and I want to use the new value instead.) I would only have to change the value in the anxiety_cutoff object once, at the beginning of the code chunk, and all of the subsequent code using that object would be similarly updated. Otherwise, I would have to find and replace all the places in my code where the number 2.047619 - which is a great way to miss something!

In short, using a clearly named object to store this value makes the code easier to navigate, more resilient to later updates, and more transparent in its meaning.

Relabel a Variable

The gender variable currently contains number codes. Let’s recode them into labels so that when we create models or visualisations, they’ll be easy to read.

For this task, we again want to apply different operations for different rows, so we use dplyr::case_when() to decide how to do this.

anx_data <- dplyr::mutate(
  anx_data,
  gender = dplyr::case_when(
    gender == 1 ~ "female",
    gender == 2 ~ "male",
    gender == 3 ~ "non-binary",
    .default = "other/pnts"
  )
)

Keep Correct Cases

Note

We didn’t get to this in the live session, but it’s also review from last week. The values we want to filter on are from the Codebook. Make sure you understand why this command works, and let us know in a drop-in or practical if you’d like any help.

Finally, we need to keep only real cases that can ethically participate. These are cases that:

  • distribution is “anonymous” (not “preview”)
  • consent is “Yes” (not any other value)
  • age is between 18 and a reasonable upper limit (e.g. 99)
anx_data_final <- dplyr::filter(
  anx_data,
  distribution == "anonymous" & consent == "Yes" & dplyr::between(age, 18, 99)
)

Kahoot! Time