Tutorial 01: Logical Assertions and Filtering

Overview

In the first part of this module, we’re going to practice a few skills in depth that you’ve already seen in Psychology as a Science, as well as encouraging you to expand those skills a little bit as well. In the first week, we’re going to do a deep dive on filtering data. To do that, we’re going to do some step by step revision of logical assertions, which are the crucial foundation of filtering successfully and accurately.

Why Filtering?

We’re starting on logical assertions and filtering for a few reasons:

Logical assertions will come up over and over; it’s a core skill for working with data.
Filtering data is an essential element of data cleaning and wrangling.
We’re also practicing some important peripheral skills, like assigning to objects, working with vectors, and learning about logical data.

Logical Data and Assertions

Before we jump into filter()ing, we’re going to take a step back and start by exploring logical assertions.

I’ve been throwing around this term, “logical assertions”, but I haven’t clearly explained what I mean by that phrase. So, here’s the first Vocab box; we’ll use these to highlight words or terms that you might not be familiar with, or that have a different meaning in the context of R or data.

Vocab: Logical Assertions

Logical assertions make a claim that can be either true or false. This claim can be very simple, or very complicated.

When we “evaluate”, or run, the assertion, R will return the words TRUE and FALSE, in the same way that it will return the result of a mathematical operation. These two words are special symbols in R - they aren’t words, they’re values. (Note that the letters T and F are reserved in the same way!) These two values constitute logical data in R.

In these tutorials you’ll sometimes see “MoRe About” boxes like the one below. The content in these boxes is optional - it won’t be necessary in any of your assessments. We include this extra info for the curious and keen, as they’ll help you understand the why and how of R better; but they aren’t essential.

MoRe About: Logical Data

It might strike you as a bit odd that the words TRUE and FALSE (and letters T and F) are values in R. It’s worth getting used to this idea, because we’ll be seeing a lot about these values in a moment!

First, let’s check that T and F do in fact mean TRUE and FALSE, by simply running them as code in a code chunk.

TRUE

[1] TRUE

[1] TRUE

FALSE

[1] FALSE

[1] FALSE

You might notice that when we type them in a code chunk, TRUE/T and FALSE/F turn a different colour to indicate that they’re values in R.

First, as contrast, notice that other words and single letters are not special values - this is specific to TRUE/T and FALSE/F only, so running another letter, like R, just produces an “object not found” error. Similarly, only TRUE/T and FALSE/F are special values; since R is case-sensitive, other spellings don’t have any special values. Finally, using “double quotes” turns any of these symbols into a string, and they no longer have their special true/false value.

Error in eval(expr, envir, enclos): object 'R' not found

true

Error in eval(expr, envir, enclos): object 'true' not found

True

Error in eval(expr, envir, enclos): object 'True' not found

"FALSE"

[1] "FALSE"

In short, to use logical data, you MUST use TRUE/T and FALSE/F without quotes, in all caps, only.

I said a moment ago that TRUE/T and FALSE/F are colour-coded in R to indicate that they are special values. We can go a step beyond that - not only are they special values, they can also be converted into numbers. Specifically, TRUE/T is 1, and FALSE/F is 0. This evens lets us do maths with them. Remember that TRUE is 1, so we can add two TRUEs together:

TRUE + TRUE

[1] 2

If you think there’s some sort of trickery there, we can ask R whether it’s the case that TRUE is exactly equal to 1:

TRUE == 1

[1] TRUE

And if we really push the boundaries, we can do just about any maths we like:

(TRUE + TRUE + TRUE) ^ (TRUE + TRUE)

[1] 9

Now, this last example is a little silly - this isn’t the sort of thing we’ll ever need to do on this course. But it does illustrate that you can force, or coerce, logical data into numeric data. This can be very useful, for example, in counting how many TRUEs you have.

Feel free to experiment further with the characteristics of logical data. When you encounter a new feature of R, it’s always a good idea to play with it a bit and see what happens in different situations. Don’t be afraid of getting errors - that’s just part of the process. Have fun!

If that’s clear as mud, let’s try producing some logical data to get the hang of assertions.

Task 1

Evaluate the following assertions in R.

34 is greater than 10
3000 is less than or equal to 42
0 does not equal 1
2 equals 4

Hint

These may cause you some trouble if the notation is unfamiliar.

For “less than or equal to”, R won’t recognise the \(\le\) symbol. Instead, we combine two operators, “less than” < and “equal to” =, in the same order we’d normally read them aloud. The same goes for “greater than or equal to”, >=. (It does have to be this way round; try =< and => to see what happens.)

For “does not equal”, ! is common notation in R for “not”, or the reverse of something. So != can be read as “not-equals”. (See what happens if you run !TRUE in a code chunk.)

For “equals”, if you try this with a single equals sign, you would have had a strange error:

2 = 4

Error in 2 = 4: invalid (do_set) left-hand side to assignment

The problem is that the single equals sign =, like the comma, has some very specialised syntactic uses, including one equivalent to the assignment operator <-. Single equals = also has an important and specific role to play in function arguments. In short, = is a special operator that doesn’t assert that two things are equal. Instead, “exactly equals” in R is “double-equals” (or “exactly and only”), ==.

Solution

Write each assertion on separate lines. R will return a single TRUE or FALSE for each one.

34 > 10

[1] TRUE

3000 <= 42

[1] FALSE

0 != 1

[1] TRUE

2 == 4

[1] FALSE

Vectorised Assertions

What about assertions for more than one number at once? Evaluating individual numbers, as we’ve just done, is fine - and is sometimes very helpful! - but we often have a bunch of numbers, like reaction times on a button-pressing task or ratings on a personality scale, that we might want to evaluate. Doing that one number at a time for 100 or 1000 responses would be really tedious, so instead we’ll take advantage of a feature of R, called vectorisation. To do that, we first need to have a look at one of the basic units of storing data in R: the vector.

Vocab: Vectors

Vectors are collections of elements in R. For example, we can produce a vector of the numbers 1 through 20 using the “through” operator : like this:

1:20

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

We can also collect (or combine, or concatenate) elements of our choice into a vector with c():

c(1:5, 27, 21250, 839:830)

 [1]     1     2     3     4     5    27 21250   839   838   837   836   835
[13]   834   833   832   831   830

All of the elements of a vector must be the same type of data. Different types will be coerced, or forced, into the most general data type. Here, all the elements have been coerced to strings (note the double quotes!).

c(15, "cat", TRUE)

[1] "15"   "cat"  "TRUE"

Remember, we wanted to evaluate lots of logical assertions at once. As an example, let’s evaluate whether each of the numbers 130 through 140 are greater than or equal to 135. To do this, we can write our assertion like this:

130:140 >= 135

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Instead of having to write an assertion for each number - 130 >= 135, 131 >= 135, 132 >= 135… - we can write an assertion using a vector, and R will evaluate the assertion for each element of the vector. We first create the vector 130:140, then make an assertion, >= 135, that applies to each element.

The output is also a vector, but this time a vector of FALSEes and TRUEs that correspond to each element of the input vector. So the first FALSE is the result of 130 >= 135, the second FALSE corresponds to 131 >= 135, and so on.

Why might this be useful? As a quick example, imagine that you’ve collected data for a study and want to identify participants to remove based on age. To make sure your meet your ethical requirements, you must remove anyone who is too young to consent. This is a such a common task that it’s worth having a go now!

Task 2

First, store the vector of participant ages below in a new object called ages.

c(18, 34, 57, 19, 21, 22, 16, 48, 26, 22, 18, 17, 18, 18, 20)

Next, write an assertion that returns TRUE for participants in ages who are at or above the ethical age of consent to participate as adults, and FALSE for those who are too young and must be removed.

Remember that you can copy code from a code chunk by hovering over the chunk and clicking the clipboard icon that appears on the right side.

Hint

For the ethical age of consent, use the age of majority in England.

You’ll need two lines of code for this task. For the first, use the assignment operator <- to store the vector of numbers as the object ages. For the second, write a logical assertion about that object.

Solution

The age of majority in England is 18, which we will use as our cutoff.

ages <- c(18, 34, 57, 19, 21, 22, 16, 48, 26, 22, 18, 17, 18, 18, 20)

ages >= 18

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[13]  TRUE  TRUE  TRUE

As before, we get a vector of TRUEs and FALSEs, corresponding to each element in the original ages vector. So, the two FALSEs correspond to the values of 16 and 17 in the original vector.

You might notice that the intermediate step of creating the ages object isn’t actually necessary. However, this style of assertion using object (or variable) names as we did in ages >= 18 is about to come in very handy in the next section.

Filter

With some assertion-writing under our belts, let’s translate this process to a dataset.

We’ll now (re-)encounter the filter() function from the {dplyr} package, which you should already have some experience with from Psychology as a Science last term. Don’t worry if you’re feeling a bit rusty - that’s what this tutorial is for!

If you’ve been wondering why I’m going on about assertions and logical data, filter() is one of the main reasons. This function is absolutely essential for working with and cleaning data. The main thing that filter() does is keep or drop cases (i.e. rows) based on the criteria you give it - and those criteria, as you might see, are the exact same logical assertions we’ve been working with so far.

Before we go any further, we’ll need some data to work with.

Task 3

Copy the code below into a new code chunk in your own Quarto document and run the code. You should see the new object, my_data, appear in your Environment.

Then, call the name of the dataset to see what’s in it.

my_data <- readr::read_csv("data/tutorial_01_data.csv")

my_data

You should see we have a few different variables to work with. This data is randomly generated, so it doesn’t correspond to a real study. Hmm… I wonder what that message variable is about? It doesn’t look like anything at the moment…

Basic Form

The “basic form” of a function is often written in something called pseudocode. You will have seen a lot of pseudocode in the discovr tutorials - they’re a template for how a function is laid out, with placeholder elements describing what each piece is or does. Here’s the pseudocode basic form of filter():

dplyr::filter(
  dataset_name,
  logical_assertion
)

This code won’t work as is; it’s intended as a template. You should swap out dataset_name with the name of the object that contains the dataset, and logical_assertion with a logical assertion telling the function which cases to keep.

What about the pipe?

The pseudocode “basic form” above might look a little different than you might have expected. If you got used to using the pipe operator |> in Psychology as a Science, you might have instead expected something like this:

dataset_name |> 
  dplyr::filter(
    logical_assertion
  )

We will be using the pipe on this module, but we’re setting it aside for the moment. I really want you to focus only on the core skills of logical assertions and filtering, without worrying about the pipe. In a couple weeks, we’ll re-introduce the pipe and get into a bit more depth about how and why it works.

When we write logical assertions within filter(), a key thing to remember is that we use the names of variables in the dataset to write the assertions.

Let’s try applying that to the dataset we have.

Task 4

Filter the dataset to keep only participants who had more than 5 siblings.

Hint

Start with the basic form of the function and replace each element one at a time.

Write your assertion using a variable name from the dataset. Which variable contains the information about the number of siblings? What value(s) of this variable do you want to keep?

Solution

dplyr::filter(
  my_data,
  n_siblings > 5
)

# A tibble: 5 × 6
  message   age eye_colour pet   n_siblings score
  <chr>   <dbl> <chr>      <chr>      <dbl> <dbl>
1 H          26 brown      cat           10 104. 
2 E          42 brown      cat            9  85.4
3 L          46 brown      dog            9  85.8
4 L          22 grey       dog            8 109. 
5 O          35 green      cat           10 115.

For the logical assertion here, notice that I’ve used the name of the variable in the dataset, n_siblings. Essentially, we can read this code as:

Filter the dataset my_data to keep only the cases where the value in the variable n_siblings is greater than 5.

Once you’ve finished that task, have a look at the message column. Notice anything? There should be a message for you to find there! 👋

Now that we’ve got the hang of the filter() function, let’s look at some other ways we can filter that might come in handy.

Matching Values

We’ve already seen how we can match exact values, using ==. We can do this both with numbers, and with strings. However, what if we want to match more than one exact value? For instance, say I wanted to filter on eye colour for people who had brown OR green eyes - that is, I want R to return TRUE if a person has either eye colour.

Here the best option is the matching operator %in%. Here’s a short example with a vector:

eye_colour <- c("blue","grey", "brown", "brown", "green", "blue", "brown", "blue", "grey", "brown")

eye_colour %in% c("brown", "green")

 [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

In this case, the object eye_colour contains the eye colours in my dataset. These are the values I want to match to a list of possibilities.

The code eye_colour %in% c("brown", "green") does this matching. It checks whether each of the values on the left (in eye_colour) appears in the vector of possibilities on the right. We can read that second bit of code as, “Which values in eye_colour are IN the set of possible values ‘brown’ and ‘green’?”

What we get back is a vector of logical data the same length as the original left-hand-side vector, so one TRUE or FALSE for each element of eye_colour. The first TRUE matches “brown”, because the third value in eye_colour matches either brpwn OR green. The next TRUE values match another “brown”, then “green”, etc. So, %in% is a very useful way to look for one of several possible matches.

Ranges of Values

For numerical data, we often want to keep only values between an upper and lower bound. For example, we might want to exclude participants on a reaction-time task who weren’t paying attention and responded far too slow, and participants who pushed the button pre-emptively and responded far too fast. Let’s say that “too slow” is 3000 milliseconds, and “too fast” is 200 ms.

We could have two separate logical assertions for this, but the easier way is to use the convenient little function dplyr::between(). To use this function, we provide the vector or variable name to work with, the left (lowest) value, and the right (highest) value. between() returns TRUE only for the values that fall on or within the bounds. Here’s an example with some reaction-time data in milliseconds:

rt_ms <- c(139, 398, 934, 200, 458, 2496, 1949, 3069, 29, 930, 1049, 2001, 5079)

dplyr::between(rt_ms, left = 200, right = 3000)

 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[13] FALSE

Notice that the value 200 returns TRUE, as it’s on the bound. So, dplyr::between() behaves like >= plus <=.

Missing Values

An absolutely crucial task that we often need to undertake is to find and deal with missing values. In R, missing data is represented by the letters NA, which stand for “not available”.

R has a whole family of functions that start with is. that check whether data matches some particular type. So for example, there’s is.character(), is.infinite(), etc., but the main one we’re interested in is is.na(). The is.na() function returns TRUE if a piece of data IS missing, and FALSE otherwise.

Let’s say we re-ran our reaction-time study from above. But this time, we had some issues where the stimulus didn’t appear on some trials, so there was no time recorded. We could check for these cases using is.na():

rt_ms_2 <- c(139, 398, NA, 200, 458, 2496, 1949, NA, 29, NA, 1049, 2001, 5079)

is.na(rt_ms_2)

 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[13] FALSE

This time, the TRUEs represent values that ARE missing.

No tasks yet - but we’ll use these in just a moment!

Combining Assertions

Our last major topic today is combining multiple logical assertions. For this we’ll use two new operators: & AND and | OR. When assertions are combined, the overall statement returns only one TRUE or FALSE value.

& AND returns TRUE only when all of its elements are also TRUE.
| OR returns TRUE when any of its elements are TRUE.

So, let’s look at two logical assertions, one evaluating to TRUE and the other to FALSE.

29 > 6

[1] TRUE

"black" == "white"

[1] FALSE

If we combine these assertions by putting & AND between them, it is NOT true that they are both true, so we get FALSE.

29 > 6 & "black" == "white"

[1] FALSE

However, if we combine these assertions by putting | OR between them, it IS true that at least one of them is true, so we get TRUE.

29 > 6 | "black" == "white"

[1] TRUE

Task 5

Filter my_data to keep only the cases who had a dragon for a pet, AND who were younger than 16 years old.

Don’t forget to read the message when you’re done!

Solution

dplyr::filter(
  my_data,
  age < 16 & pet == "dragon"
)

# A tibble: 8 × 6
  message   age eye_colour pet    n_siblings score
  <chr>   <dbl> <chr>      <chr>       <dbl> <dbl>
1 W           7 green      dragon          4  98.3
2 E           9 brown      dragon          0 119. 
3 L           9 grey       dragon          2  90.5
4 L           7 green      dragon          0 112. 
5 D          13 blue       dragon          1  83.4
6 O          11 green      dragon          3 105. 
7 N           8 brown      dragon          4  95.9
8 E          15 green      dragon          4  91.5

That’s it for the core skills you need for this week. I strongly encourage you to try out the ChallengR task below, even if you’re not 100% confident yet. The best way to build your skills is to give it a go, and all of the tasks in the next section, while optional, can be solved with what we’ve covered in this tutorial.

ChallengR

Ready for a little puzzle? We’ll sometimes include these “ChallengR” tasks to help you push your skills in R. ChallengRs are always optional, and will never be assessed - they’re only there to inspire you to try new things! These filter() tasks are just a little tougher, but if you can crack ’em, you’ll be well on your way to mastering this skill.

For each task, producing the correctly filtered dataset will reveal a message in the message column. Put them all together to find a ✨secret message✨, which - if you do what it says - will earn you a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers! See the Games and Awards page on Canvas for a full explanation.

No Solutions

There are no solutions to these ChallengR tasks. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.

Task 6

Filter the data to keep only cases who have either hazel or black eyes.

Hint

You can either use two separate statements with | OR, or you can try using %in% from earlier.

Task 7

Filter the data to keep only cases where score is outside the bounds of 63 - 134.

Hint

This means that the scores are either less than 63 or more than 134. Again, you can use two statements, or dplyr::between() from earlier - but remember we want scores that are outside those bounds, not inside.

If you’re trying to use dplyr::between() and you’ve written a filter() statement for cases that DO fall inside the bounds, have a look through the tutorial to find a way to reverse trues and falses!

Task 8

Filter the data to keep only cases that have blue eyes; are exactly 42 years old; AND have a dog for a pet.

Once you’ve found the secret message, do what it says to get your ChallengR Kahoot! points.