TRUE
[1] TRUE
T
[1] TRUE
FALSE
[1] FALSE
F
[1] FALSE
In the first part of this module, we’re going to practice a few skills in depth that you’ve already seen in Psychology as a Science, as well as encouraging you to expand those skills a little bit as well. In the first week, we’re going to do a deep dive on filtering data. To do that, we’re going to do some step by step revision of logical assertions, which are the crucial foundation of filtering successfully and accurately.
We’re starting on logical assertions and filtering for a few reasons:
Before we jump into filter()
ing, we’re going to take a step back and start by exploring logical assertions.
I’ve been throwing around this term, “logical assertions”, but I haven’t clearly explained what I mean by that phrase. So, here’s the first Vocab box; we’ll use these to highlight words or terms that you might not be familiar with, or that have a different meaning in the context of R or data.
Logical assertions make a claim that can be either true or false. This claim can be very simple, or very complicated.
When we “evaluate”, or run, the assertion, R will return the words TRUE
and FALSE
, in the same way that it will return the result of a mathematical operation. These two words are special symbols in R - they aren’t words, they’re values. (Note that the letters T
and F
are reserved in the same way!) These two values constitute logical data in R.
In these tutorials you’ll sometimes see “MoRe About” boxes like the one below. The content in these boxes is optional - it won’t be necessary in any of your assessments. We include this extra info for the curious and keen, as they’ll help you understand the why and how of R better; but they aren’t essential.
It might strike you as a bit odd that the words TRUE
and FALSE
(and letters T
and F
) are values in R. It’s worth getting used to this idea, because we’ll be seeing a lot about these values in a moment!
First, let’s check that T
and F
do in fact mean TRUE
and FALSE
, by simply running them as code in a code chunk.
You might notice that when we type them in a code chunk, TRUE
/T
and FALSE
/F
turn a different colour to indicate that they’re values in R.
First, as contrast, notice that other words and single letters are not special values - this is specific to TRUE
/T
and FALSE
/F
only, so running another letter, like R
, just produces an “object not found” error. Similarly, only TRUE
/T
and FALSE
/F
are special values; since R is case-sensitive, other spellings don’t have any special values. Finally, using “double quotes” turns any of these symbols into a string, and they no longer have their special true/false value.
Error in eval(expr, envir, enclos): object 'R' not found
Error in eval(expr, envir, enclos): object 'true' not found
Error in eval(expr, envir, enclos): object 'True' not found
[1] "FALSE"
In short, to use logical data, you MUST use TRUE
/T
and FALSE
/F
without quotes, in all caps, only.
I said a moment ago that TRUE
/T
and FALSE
/F
are colour-coded in R to indicate that they are special values. We can go a step beyond that - not only are they special values, they can also be converted into numbers. Specifically, TRUE
/T
is 1, and FALSE
/F
is 0. This evens lets us do maths with them. Remember that TRUE
is 1, so we can add two TRUE
s together:
If you think there’s some sort of trickery there, we can ask R whether it’s the case that TRUE
is exactly equal to 1:
And if we really push the boundaries, we can do just about any maths we like:
Now, this last example is a little silly - this isn’t the sort of thing we’ll ever need to do on this course. But it does illustrate that you can force, or coerce, logical data into numeric data. This can be very useful, for example, in counting how many TRUE
s you have.
Feel free to experiment further with the characteristics of logical data. When you encounter a new feature of R, it’s always a good idea to play with it a bit and see what happens in different situations. Don’t be afraid of getting errors - that’s just part of the process. Have fun!
If that’s clear as mud, let’s try producing some logical data to get the hang of assertions.
Evaluate the following assertions in R.
These may cause you some trouble if the notation is unfamiliar.
For “less than or equal to”, R won’t recognise the \(\le\) symbol. Instead, we combine two operators, “less than” <
and “equal to” =
, in the same order we’d normally read them aloud. The same goes for “greater than or equal to”, >=
. (It does have to be this way round; try =<
and =>
to see what happens.)
For “does not equal”, !
is common notation in R for “not”, or the reverse of something. So !=
can be read as “not-equals”. (See what happens if you run !TRUE
in a code chunk.)
For “equals”, if you try this with a single equals sign, you would have had a strange error:
The problem is that the single equals sign =
, like the comma, has some very specialised syntactic uses, including one equivalent to the assignment operator <-
. Single equals =
also has an important and specific role to play in function arguments. In short, =
is a special operator that doesn’t assert that two things are equal. Instead, “exactly equals” in R is “double-equals” (or “exactly and only”), ==
.
What about assertions for more than one number at once? Evaluating individual numbers, as we’ve just done, is fine - and is sometimes very helpful! - but we often have a bunch of numbers, like reaction times on a button-pressing task or ratings on a personality scale, that we might want to evaluate. Doing that one number at a time for 100 or 1000 responses would be really tedious, so instead we’ll take advantage of a feature of R, called vectorisation. To do that, we first need to have a look at one of the basic units of storing data in R: the vector.
Vectors are collections of elements in R. For example, we can produce a vector of the numbers 1 through 20 using the “through” operator :
like this:
We can also collect (or combine, or concatenate) elements of our choice into a vector with c()
:
All of the elements of a vector must be the same type of data. Different types will be coerced, or forced, into the most general data type. Here, all the elements have been coerced to strings (note the double quotes!).
Remember, we wanted to evaluate lots of logical assertions at once. As an example, let’s evaluate whether each of the numbers 130 through 140 are greater than or equal to 135. To do this, we can write our assertion like this:
Instead of having to write an assertion for each number - 130 >= 135
, 131 >= 135
, 132 >= 135
… - we can write an assertion using a vector, and R will evaluate the assertion for each element of the vector. We first create the vector 130:140
, then make an assertion, >= 135
, that applies to each element.
The output is also a vector, but this time a vector of FALSE
es and TRUE
s that correspond to each element of the input vector. So the first FALSE
is the result of 130 >= 135
, the second FALSE
corresponds to 131 >= 135
, and so on.
Why might this be useful? As a quick example, imagine that you’ve collected data for a study and want to identify participants to remove based on age. To make sure your meet your ethical requirements, you must remove anyone who is too young to consent. This is a such a common task that it’s worth having a go now!
First, store the vector of participant ages below in a new object called ages
.
Next, write an assertion that returns TRUE
for participants in ages
who are at or above the ethical age of consent to participate as adults, and FALSE
for those who are too young and must be removed.
Remember that you can copy code from a code chunk by hovering over the chunk and clicking the clipboard icon that appears on the right side.
For the ethical age of consent, use the age of majority in England.
You’ll need two lines of code for this task. For the first, use the assignment operator <-
to store the vector of numbers as the object ages
. For the second, write a logical assertion about that object.
The age of majority in England is 18, which we will use as our cutoff.
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE
As before, we get a vector of TRUE
s and FALSE
s, corresponding to each element in the original ages
vector. So, the two FALSE
s correspond to the values of 16 and 17 in the original vector.
You might notice that the intermediate step of creating the ages
object isn’t actually necessary. However, this style of assertion using object (or variable) names as we did in ages >= 18
is about to come in very handy in the next section.
With some assertion-writing under our belts, let’s translate this process to a dataset.
We’ll now (re-)encounter the filter()
function from the {dplyr} package, which you should already have some experience with from Psychology as a Science last term. Don’t worry if you’re feeling a bit rusty - that’s what this tutorial is for!
If you’ve been wondering why I’m going on about assertions and logical data, filter()
is one of the main reasons. This function is absolutely essential for working with and cleaning data. The main thing that filter()
does is keep or drop cases (i.e. rows) based on the criteria you give it - and those criteria, as you might see, are the exact same logical assertions we’ve been working with so far.
Before we go any further, we’ll need some data to work with.
Copy the code below into a new code chunk in your own Quarto document and run the code. You should see the new object, my_data
, appear in your Environment.
Then, call the name of the dataset to see what’s in it.
You should see we have a few different variables to work with. This data is randomly generated, so it doesn’t correspond to a real study. Hmm… I wonder what that message
variable is about? It doesn’t look like anything at the moment…
The “basic form” of a function is often written in something called pseudocode. You will have seen a lot of pseudocode in the discovr
tutorials - they’re a template for how a function is laid out, with placeholder elements describing what each piece is or does. Here’s the pseudocode basic form of filter()
:
This code won’t work as is; it’s intended as a template. You should swap out dataset_name
with the name of the object that contains the dataset, and logical_assertion
with a logical assertion telling the function which cases to keep.
The pseudocode “basic form” above might look a little different than you might have expected. If you got used to using the pipe operator |>
in Psychology as a Science, you might have instead expected something like this:
We will be using the pipe on this module, but we’re setting it aside for the moment. I really want you to focus only on the core skills of logical assertions and filtering, without worrying about the pipe. In a couple weeks, we’ll re-introduce the pipe and get into a bit more depth about how and why it works.
When we write logical assertions within filter()
, a key thing to remember is that we use the names of variables in the dataset to write the assertions.
Let’s try applying that to the dataset we have.
Filter the dataset to keep only participants who had more than 5 siblings.
Start with the basic form of the function and replace each element one at a time.
Write your assertion using a variable name from the dataset. Which variable contains the information about the number of siblings? What value(s) of this variable do you want to keep?
# A tibble: 5 × 6
message age eye_colour pet n_siblings score
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 H 26 brown cat 10 104.
2 E 42 brown cat 9 85.4
3 L 46 brown dog 9 85.8
4 L 22 grey dog 8 109.
5 O 35 green cat 10 115.
For the logical assertion here, notice that I’ve used the name of the variable in the dataset, n_siblings
. Essentially, we can read this code as:
Filter the dataset
my_data
to keep only the cases where the value in the variablen_siblings
is greater than 5.
Once you’ve finished that task, have a look at the message
column. Notice anything? There should be a message for you to find there! 👋
Now that we’ve got the hang of the filter()
function, let’s look at some other ways we can filter that might come in handy.
We’ve already seen how we can match exact values, using ==
. We can do this both with numbers, and with strings. However, what if we want to match more than one exact value? For instance, say I wanted to filter on eye colour for people who had brown OR green eyes - that is, I want R to return TRUE
if a person has either eye colour.
Here the best option is the matching operator %in%
. Here’s a short example with a vector:
eye_colour <- c("blue","grey", "brown", "brown", "green", "blue", "brown", "blue", "grey", "brown")
eye_colour %in% c("brown", "green")
[1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
In this case, the object eye_colour
contains the eye colours in my dataset. These are the values I want to match to a list of possibilities.
The code eye_colour %in% c("brown", "green")
does this matching. It checks whether each of the values on the left (in eye_colour
) appears in the vector of possibilities on the right. We can read that second bit of code as, “Which values in eye_colour
are IN the set of possible values ‘brown’ and ‘green’?”
What we get back is a vector of logical data the same length as the original left-hand-side vector, so one TRUE
or FALSE
for each element of eye_colour
. The first TRUE
matches “brown”, because the third value in eye_colour
matches either brpwn OR green. The next TRUE values match another “brown”, then “green”, etc. So, %in%
is a very useful way to look for one of several possible matches.
For numerical data, we often want to keep only values between an upper and lower bound. For example, we might want to exclude participants on a reaction-time task who weren’t paying attention and responded far too slow, and participants who pushed the button pre-emptively and responded far too fast. Let’s say that “too slow” is 3000 milliseconds, and “too fast” is 200 ms.
We could have two separate logical assertions for this, but the easier way is to use the convenient little function dplyr::between()
. To use this function, we provide the vector or variable name to work with, the left (lowest) value, and the right (highest) value. between()
returns TRUE
only for the values that fall on or within the bounds. Here’s an example with some reaction-time data in milliseconds:
rt_ms <- c(139, 398, 934, 200, 458, 2496, 1949, 3069, 29, 930, 1049, 2001, 5079)
dplyr::between(rt_ms, left = 200, right = 3000)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[13] FALSE
Notice that the value 200 returns TRUE
, as it’s on the bound. So, dplyr::between()
behaves like >=
plus <=
.
An absolutely crucial task that we often need to undertake is to find and deal with missing values. In R, missing data is represented by the letters NA
, which stand for “not available”.
R has a whole family of functions that start with is.
that check whether data matches some particular type. So for example, there’s is.character()
, is.infinite()
, etc., but the main one we’re interested in is is.na()
. The is.na()
function returns TRUE
if a piece of data IS missing, and FALSE
otherwise.
Let’s say we re-ran our reaction-time study from above. But this time, we had some issues where the stimulus didn’t appear on some trials, so there was no time recorded. We could check for these cases using is.na()
:
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[13] FALSE
This time, the TRUE
s represent values that ARE missing.
No tasks yet - but we’ll use these in just a moment!
Our last major topic today is combining multiple logical assertions. For this we’ll use two new operators: &
AND and |
OR. When assertions are combined, the overall statement returns only one TRUE
or FALSE
value.
&
AND returns TRUE
only when all of its elements are also TRUE
.|
OR returns TRUE
when any of its elements are TRUE
.So, let’s look at two logical assertions, one evaluating to TRUE
and the other to FALSE
.
If we combine these assertions by putting &
AND between them, it is NOT true that they are both true, so we get FALSE
.
However, if we combine these assertions by putting |
OR between them, it IS true that at least one of them is true, so we get TRUE
.
Filter my_data
to keep only the cases who had a dragon for a pet, AND who were younger than 16 years old.
Don’t forget to read the message when you’re done!
# A tibble: 8 × 6
message age eye_colour pet n_siblings score
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 W 7 green dragon 4 98.3
2 E 9 brown dragon 0 119.
3 L 9 grey dragon 2 90.5
4 L 7 green dragon 0 112.
5 D 13 blue dragon 1 83.4
6 O 11 green dragon 3 105.
7 N 8 brown dragon 4 95.9
8 E 15 green dragon 4 91.5
That’s it for the core skills you need for this week. I strongly encourage you to try out the ChallengR task below, even if you’re not 100% confident yet. The best way to build your skills is to give it a go, and all of the tasks in the next section, while optional, can be solved with what we’ve covered in this tutorial.
Ready for a little puzzle? We’ll sometimes include these “ChallengR” tasks to help you push your skills in R. ChallengRs are always optional, and will never be assessed - they’re only there to inspire you to try new things! These filter()
tasks are just a little tougher, but if you can crack ’em, you’ll be well on your way to mastering this skill.
For each task, producing the correctly filtered dataset will reveal a message in the message
column. Put them all together to find a ✨secret message✨, which - if you do what it says - will earn you a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers! See the Games and Awards page on Canvas for a full explanation.
There are no solutions to these ChallengR tasks. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.
Filter the data to keep only cases who have either hazel or black eyes.
You can either use two separate statements with |
OR, or you can try using %in%
from earlier.
Filter the data to keep only cases where score is outside the bounds of 63 - 134.
This means that the scores are either less than 63 or more than 134. Again, you can use two statements, or dplyr::between()
from earlier - but remember we want scores that are outside those bounds, not inside.
If you’re trying to use dplyr::between()
and you’ve written a filter()
statement for cases that DO fall inside the bounds, have a look through the tutorial to find a way to reverse trues and falses!
Filter the data to keep only cases that have blue eyes; are exactly 42 years old; AND have a dog for a pet.
Once you’ve found the secret message, do what it says to get your ChallengR Kahoot! points.