Tutorial 04: Summarising data

Introduction

In previous tutorials, we’ve learned about several useful functions from dplyr used for selecting variables, searching for cases that meet specific conditions, and creating new variables. At this point, we know how to search through data and work with datasets, but how do we start making sense of it all? This is where we need to start thinking about summarising data, which is what this tutorial covers. To make things more fun, we’ll pair some new dplyr functions with the pipe |>, which we’ve learned about in the previous tutorial. We’ll end with some easy functions for making these summary tables pretty, and some optional extras about plotting and interpreting CIs.

Data

In late 2023, some reputable newspapersTM started reporting a new strategy that sailors had implemented for deterring orcas1 that frequently attacked boats - heavy metal. Some sailors noticed that orca attacks tended to last shorter than usual if they were blasting heavy metal music. Others however, had a vastly different experience.

The Analysing Data team was hired by the South Pole Sailors’ Union to sail through the waters of Gibraltar, collect some high quality data, and settle the debate once and for all. We collected information about over 150 attacks. During each attack, we either played metal music from various bands, or the soundtrack from the movie Shrek. Now we can look at the general trends in the data and produce some summaries.

Task 1

Load packages and explore the dataset.

  1. Load the tidyverse .
  2. The dataset file can be found on the path data/orca_data.csv . Use readr::read_csv() to read the data in.
  3. Store the dataset in an object called orca_tib so we can use it later.
  4. Explore the dataset to become familiar with the variables.

You’ll need to use the library() function to load packages. This function needs to be called somewhere at the beginning of the document.


Next


You can read datasets with readr::read_csv(). Have a look at previous weeks’ code for examples!


Load packages:

library(tidyverse)

Load the dataset:

orca_tib <- readr::read_csv("data/orca_data.csv")

Exploring the data - there are several things you could try here. Some options include:

  • orca_tib - will print the whole dataset, 10 rows at the time

  • View(orca_tib) - will open the data in the viewer. It’s better to only use this option in the console and not in the document itself. Weird things happen if you try to render a document that contains the View() function.

attack_id date_of_attack month attack_duration music_genre band
12 2020-10-03 10 58 Shrek Shrek Soundtracks
18 2020-10-14 10 51 Heavy Metal LoTL
19 2020-10-16 10 40 Heavy Metal Opeth
21 2020-10-21 10 44 Heavy Metal LoTL
23 2020-10-23 10 53 Shrek Shrek Soundtracks
24 2020-10-24 10 52 Heavy Metal Saor
26 2020-10-27 10 55 Shrek Shrek Soundtracks
29 2020-11-02 11 52 Shrek Shrek Soundtracks
33 2020-11-11 11 45 Shrek Shrek Soundtracks
34 2020-11-17 11 43 Shrek Shrek Soundtracks
36 2020-11-24 11 44 Shrek Shrek Soundtracks
41 2020-12-17 12 44 Heavy Metal Powerwolf
42 2020-12-18 12 54 Shrek Shrek Soundtracks
46 2020-12-29 12 50 Shrek Shrek Soundtracks
48 2021-01-02 1 23 Heavy Metal Opeth
55 2021-01-18 1 43 Heavy Metal Powerwolf
57 2021-01-25 1 44 Heavy Metal Saor
59 2021-01-29 1 43 Shrek Shrek Soundtracks
60 2021-02-08 2 47 Shrek Shrek Soundtracks
62 2021-02-10 2 19 Heavy Metal Opeth
65 2021-02-27 2 37 Shrek Shrek Soundtracks
76 2021-03-30 3 50 Shrek Shrek Soundtracks
80 2021-04-18 4 32 Heavy Metal Powerwolf
86 2021-05-02 5 44 Shrek Shrek Soundtracks
90 2021-05-15 5 30 Heavy Metal Opeth
92 2021-05-21 5 47 Shrek Shrek Soundtracks
100 2021-06-02 6 64 Shrek Shrek Soundtracks
101 2021-06-04 6 31 Heavy Metal Iron Maiden
105 2021-06-11 6 61 Shrek Shrek Soundtracks
106 2021-06-12 6 37 Shrek Shrek Soundtracks
107 2021-06-13 6 51 Shrek Shrek Soundtracks
108 2021-06-14 6 30 Heavy Metal Iron Maiden
110 2021-06-18 6 45 Heavy Metal Powerwolf
111 2021-06-19 6 25 Shrek Shrek Soundtracks
118 2021-07-01 7 27 Heavy Metal Iron Maiden
122 2021-07-09 7 42 Heavy Metal Opeth
125 2021-07-15 7 43 Shrek Shrek Soundtracks
131 2021-07-30 7 41 Heavy Metal Saor
133 2021-08-05 8 42 Shrek Shrek Soundtracks
142 2021-08-28 8 57 Heavy Metal Saor
143 2021-08-30 8 59 Shrek Shrek Soundtracks
146 2021-09-13 9 29 Heavy Metal Powerwolf
147 2021-09-14 9 67 Shrek Shrek Soundtracks
158 2021-10-10 10 37 Heavy Metal Saor
161 2021-10-17 10 53 Heavy Metal Opeth
166 2021-11-02 11 64 Shrek Shrek Soundtracks
167 2021-11-03 11 38 Heavy Metal Opeth
169 2021-11-09 11 71 Heavy Metal Saor
171 2021-11-15 11 50 Shrek Shrek Soundtracks
174 2021-11-26 11 41 Heavy Metal LoTL
177 2021-12-01 12 60 Shrek Shrek Soundtracks
181 2021-12-13 12 59 Shrek Shrek Soundtracks
183 2021-12-19 12 42 Heavy Metal Saor
186 2021-12-31 12 44 Shrek Shrek Soundtracks
189 2022-01-09 1 26 Heavy Metal Iron Maiden
191 2022-01-11 1 66 Shrek Shrek Soundtracks
194 2022-01-15 1 25 Heavy Metal Saor
196 2022-01-19 1 51 Shrek Shrek Soundtracks
198 2022-01-26 1 23 Heavy Metal Iron Maiden
202 2022-02-05 2 44 Heavy Metal LoTL
207 2022-02-20 2 36 Shrek Shrek Soundtracks
210 2022-03-08 3 41 Shrek Shrek Soundtracks
214 2022-03-17 3 50 Shrek Shrek Soundtracks
217 2022-03-20 3 47 Shrek Shrek Soundtracks
218 2022-03-25 3 36 Heavy Metal Saor
220 2022-03-27 3 35 Shrek Shrek Soundtracks
221 2022-03-28 3 51 Heavy Metal LoTL
225 2022-04-02 4 27 Heavy Metal Iron Maiden
235 2022-04-25 4 41 Shrek Shrek Soundtracks
236 2022-04-26 4 37 Heavy Metal LoTL
237 2022-05-03 5 50 Heavy Metal Saor
243 2022-05-22 5 58 Heavy Metal LoTL
245 2022-05-26 5 34 Heavy Metal Iron Maiden
247 2022-05-31 5 42 Shrek Shrek Soundtracks
248 2022-06-05 6 39 Heavy Metal Saor
255 2022-07-07 7 36 Heavy Metal Opeth
256 2022-07-09 7 45 Shrek Shrek Soundtracks
257 2022-07-11 7 40 Heavy Metal Saor
262 2022-07-19 7 49 Heavy Metal Iron Maiden
268 2022-08-02 8 52 Heavy Metal Opeth
270 2022-08-05 8 52 Heavy Metal LoTL
271 2022-08-06 8 27 Heavy Metal Opeth
273 2022-08-09 8 30 Heavy Metal Iron Maiden
275 2022-08-11 8 48 Shrek Shrek Soundtracks
280 2022-08-21 8 28 Heavy Metal Powerwolf
282 2022-08-24 8 55 Heavy Metal LoTL
285 2022-09-02 9 61 Heavy Metal Iron Maiden
286 2022-09-10 9 38 Shrek Shrek Soundtracks
288 2022-09-13 9 55 Heavy Metal LoTL
290 2022-09-18 9 40 Heavy Metal Powerwolf
291 2022-09-19 9 41 Shrek Shrek Soundtracks
296 2022-10-04 10 37 Heavy Metal Iron Maiden
300 2022-10-18 10 42 Heavy Metal LoTL
306 2022-10-31 10 68 Shrek Shrek Soundtracks
308 2022-11-03 11 60 Heavy Metal LoTL
309 2022-11-08 11 46 Heavy Metal LoTL
311 2022-11-11 11 57 Heavy Metal LoTL
313 2022-11-14 11 72 Shrek Shrek Soundtracks
314 2022-11-18 11 68 Shrek Shrek Soundtracks
315 2022-11-19 11 44 Heavy Metal Iron Maiden
316 2022-11-22 11 47 Shrek Shrek Soundtracks
319 2022-11-26 11 39 Heavy Metal Iron Maiden
320 2022-11-27 11 33 Heavy Metal Iron Maiden
321 2022-11-28 11 32 Heavy Metal Iron Maiden
323 2022-12-03 12 54 Shrek Shrek Soundtracks
327 2022-12-07 12 41 Heavy Metal Opeth
328 2022-12-11 12 57 Heavy Metal LoTL
330 2022-12-15 12 36 Heavy Metal Iron Maiden
337 2022-12-26 12 53 Heavy Metal Powerwolf
341 2023-01-12 1 45 Heavy Metal Saor
348 2023-01-24 1 31 Shrek Shrek Soundtracks
349 2023-01-26 1 31 Heavy Metal Iron Maiden
350 2023-01-27 1 42 Shrek Shrek Soundtracks
351 2023-01-29 1 27 Heavy Metal Iron Maiden
352 2023-01-30 1 39 Heavy Metal Saor
357 2023-02-12 2 46 Heavy Metal LoTL
358 2023-02-13 2 38 Heavy Metal Saor
359 2023-02-15 2 76 Shrek Shrek Soundtracks
360 2023-02-16 2 38 Heavy Metal Opeth
363 2023-02-19 2 22 Heavy Metal LoTL
370 2023-03-05 3 63 Shrek Shrek Soundtracks
371 2023-03-08 3 27 Heavy Metal Powerwolf
373 2023-03-10 3 50 Shrek Shrek Soundtracks
380 2023-03-19 3 38 Shrek Shrek Soundtracks
384 2023-03-31 3 41 Shrek Shrek Soundtracks
385 2023-04-02 4 44 Heavy Metal Iron Maiden
398 2023-05-01 5 62 Shrek Shrek Soundtracks
404 2023-05-21 5 63 Shrek Shrek Soundtracks
407 2023-05-25 5 53 Shrek Shrek Soundtracks
409 2023-05-30 5 39 Shrek Shrek Soundtracks
410 2023-06-11 6 44 Heavy Metal LoTL
413 2023-06-19 6 51 Heavy Metal Saor
414 2023-06-20 6 54 Shrek Shrek Soundtracks
418 2023-06-24 6 51 Shrek Shrek Soundtracks
424 2023-07-10 7 36 Heavy Metal Powerwolf
425 2023-07-12 7 16 Heavy Metal Iron Maiden
426 2023-07-18 7 49 Heavy Metal LoTL
427 2023-07-25 7 43 Shrek Shrek Soundtracks
428 2023-07-26 7 66 Shrek Shrek Soundtracks
429 2023-07-27 7 12 Heavy Metal Iron Maiden
433 2023-08-01 8 55 Shrek Shrek Soundtracks
434 2023-08-02 8 47 Shrek Shrek Soundtracks
435 2023-08-11 8 52 Shrek Shrek Soundtracks
440 2023-08-28 8 48 Heavy Metal Saor
443 2023-09-13 9 56 Shrek Shrek Soundtracks
444 2023-09-14 9 52 Heavy Metal Saor
446 2023-09-23 9 40 Heavy Metal Saor
448 2023-10-09 10 43 Heavy Metal Opeth
449 2023-10-11 10 41 Heavy Metal Saor
452 2023-10-22 10 40 Shrek Shrek Soundtracks
454 2023-10-24 10 44 Heavy Metal Opeth
458 2023-10-31 10 28 Heavy Metal Opeth
460 2023-11-02 11 45 Shrek Shrek Soundtracks
462 2023-11-10 11 46 Shrek Shrek Soundtracks
463 2023-11-11 11 60 Shrek Shrek Soundtracks
464 2023-11-12 11 NA Heavy Metal Powerwolf
465 2023-11-14 11 50 Shrek Shrek Soundtracks
466 2023-11-17 11 34 Heavy Metal Iron Maiden
467 2023-11-18 11 58 Shrek Shrek Soundtracks
470 2023-11-21 11 59 Shrek Shrek Soundtracks
477 2023-12-09 12 61 Shrek Shrek Soundtracks
478 2023-12-10 12 40 Heavy Metal Opeth
484 2023-12-24 12 45 Heavy Metal Powerwolf
485 2023-12-26 12 39 Shrek Shrek Soundtracks

Codebook

  • attack_id - ID number of the orca attack

  • date_of_attack - date when the attack happened

  • month - month when the attack was recorded. Numeric (1 - January, 12 - December)

  • attack_duration - duration of the attack in minutes

  • music_genre - did the crew play heavy metal or Shrek soundtracks during the attacks?

  • band - name of the band

Research Question

Before we jump into the code, let’s first think for a moment about what we’re going to do and why.

We have our dataset, which was described by the Codebook above. From next week, we’ll start doing statistical testing on datasets like, this, but summarising your data, using “descriptives”, is absolutely essential before we start. These descriptives, which are the main focus of today’s tutorial, describe your data, often using measures of central tendency and other ways to quantify the data you have. The summary values you include in “descriptives” will vary, depending on your dataset and the things you want to know. So, before we do anything else, we need to figure out what we want know. Have a go at the quiz below, and refer to the Introduction if you don’t remember some details.

Question 1

Which of the following statements best captures the key research question of interest for this study?

Question 2

What is the outcome, or dependent variable? Said another way: which variable contains the quantity we are primarily trying to investigate?

Question 3

What is the predictor, or independent variable? Said another way: which variable are we trying to find out the impact or efficacy of?

 

This setup - comparing some score, behaviour, latency etc. between two independent conditions - is an extremely common and useful study format in Psychology and in science generally. So much so, we’ll continue to work on this same design next week! For now, we still have some thinking to do.

Task 2

Referring to the Introduction and what you have learned about the study design thus far, what descriptive information would be useful in order to investigate the research question? Make a list in your notebook document and explain why you included each element.

Hi again! It’s tempting to skip tasks like this, but we strongly recommend you don’t. The practice will come in handy in the future.

If you need a refresher on descriptive statistics, look back on Lectures 6 and 7 of Psychology as a Science for some good options. There was also a new descriptive measure introduced in this week’s lecture that you may consider including.

To start, we already identified the two key variables we’re interested in from the quiz above. As a reminder, we want to know about the duration of orca attacks when we are playing metal music, vs when we are playing the Shrek soundtrack. Here, the duration of attacks is the key thing we really want to know about. So, how can we summarise, or describe, this variable?

If you’re feeling a bit stumped, the first thing we can do is think about the measures of central tendency from Week 6 of PAAS. In that lecture, there were three key techniques covered: mean, median, and mode. This lecture mentions that the mean is often preferred, because it takes all the observations into consideration and essentially captures the typical value in a sample. (You may have noticed that the mean comes up a lot, and we will see more of it in the future!) That sounds great, so one of the main things we’ll want to include is the mean of attack duration.

In the PAAS lecture, we also saw however that the mean is sensitive to outliers. This means that if there are extreme or unusual values in the dataset that are quite different from all the others, the mean that we calculate for that sample might be very different than it would be if just those few extreme values weren’t there. In other words, if we just looked at the mean, it could be misleading without other information. So, what other information might be useful?

In addition to the mean, it’s a good idea to get some measures of how much the scores are spread around the mean. Some very simple examples of this are the range, or minimum and maximum values, which are simply the highest and lowest values in the sample. Since these are the most extreme values, if these values are very small (or very big), they might tip us off that something’s going on in the data we need to look into.

A more sophisticated measure is the standard deviation, which is almost always reported alongside the mean. The standard deviation tells us how much the values in the sample differ from the mean of those values, on average. It’s a useful way to capture whether the values tend to be quite similar to the mean, so the mean represents the sample well (in which case the SD will be small); or if the values tend to be quite different from the mean, so the mean doesn’t represent the sample well (in which case the SD will be large).

It’s often a good idea to obtain the median, another measure of central tendency. The mean will be our key descriptive measure, but ideally if our values are normally distributed, the mean and the median should be quite similar to each other. If they aren’t, that might suggest that the data are skewed - that is, there tend to be more or less of certain values than we expected.

In this week’s lecture we also introduced confidence intervals (CIs), which we will explore in more depth in just a bit.

Finally, we’re interested in all of these measures both overall, and for the two music genre groups separately, so we can compare them.

To practice, try making a short list of each of the descriptive measures we’ve talked about here to refer to later on. It would be a good idea to summarise why each one should be included and what information it contributes…you know, in case you need it someday…

Now that we have an idea of what we are doing and why, let’s get going in R!

 

Overall Summaries with summarise()

Before we dive into comparing the two groups, it’s good practice to create some general summaries for our whole sample. Different descriptive statistics are useful in different situations. For our purposes, we want to create a table that contains the values we listed above:

  • Number of cases

  • Minimum and maximum values for attack_duration

  • Mean, median, and standard deviation for attack_duration

  • Confidence intervals for attack_duration

Basic Form

To create summary tables, we can use the summarise() function from the dplyr package. The summarise() function works very similarly to mutate(). We start by telling the function which dataset we want to use with the .data = argument, and then we give instructions for what kind of summary value we want to create and how we want to call a column that contains this value.

A code creating a basic summary table looks like this:

dplyr::summarise(
  .data = our_dataset, 
  summary_column_name = some_function
)

In the last tutorial, we’ve also introduced the pipe operator, which looks like this: |>. The pipe operator is very handy once our code starts getting bulky - it makes our code more efficient and easy to read. So let’s re-write the summary code above to use the pipe:

some_dataset |> 
  dplyr::summarise(
    summary_column_name = some_function
  )

Notice that the .data argument has now disappeared. That’s because it’s the first argument - functions in R automatically pipe the object on the left hand side (in this case some_dataset) into the first unnamed argument, so we don’t need to specify it anymore.

Now let’s adapt this code to create some actual summaries.

Counting Cases

We’ll start by counting the number of cases in the dataset, which we’ll store in a new column we can call n_cases. The function we can use for this is the n() function from the {dplyr} package. dplyr::n() is a bit of a funny function, in that it doesn’t take any arguments and can only be used inside of other {dplyr} functions. We therefore don’t need to modify the function itself. In this scenario, we only need to change the dataset name and the variable name for our code to work.

Task 3

Copy the code below and then replace the dataset name and the summary column name with appropriate values. Run the code in your .qmd file to see the results.

# create a table containing the number of participants
some_dataset |> 
  dplyr::summarise(
    summary_column_name = dplyr::n()
  )

some_dataset needs to be replaced with the name of the dataset we’re using.


Next


summary_column_name can be changed to pretty much anything, but n_cases will do nicely here - it’s brief and descriptive.


The final code should look like this:

orca_tib |> 
  dplyr::summarise(
    n_cases= dplyr::n()
  )
# A tibble: 1 × 1
  n_cases
    <int>
1     164

Complete the blank: In total, there were cases in the sample.

Note that we’re not saving this summary into a new object just yet. We’re going to keep adding more summary values, and then save the finished summary object at the very end, once our code has grown a little.

We’ve previously learned how to get the number of participants using nrow() (see Practical 02). Here we’re using dplyr::n() combined with the dplyr::summarise() function, because it is a much more flexible approach that allows us to create summary tables containing a range of different descriptive statistics, not just the number of participants.

Descriptives

Next up, we’ll calculate some descriptive values that help us understand the data we’ve collected: minimum, maximum, mean, median, and standard deviation of attack_duration, our key variable of interest. We can tackle all of these in one section, because the functions for creating these summaries work in exactly the same way.

The minimum can be obtained using the min() function. The min() function (as well as the other functions in this section) comes from base R, so we don’t need to include a package call in front of it.

The function takes a variable name as an argument, and returns the minimum value. On its own, it can be used as:

some_variable <- c(6, 4, 1, 9, 3, 7)
min(some_variable, na.rm = TRUE)
[1] 1

The minimum value in some_variable is 1, so the code above returns 1. We’ve also added the argument na.rm = TRUE, which ensures that if there are any missing values in the variable, the min() function will ignore them.

This section is skippable if you’re happy to just keep writing na.rm = TRUE in your functions without an explanation.

We can run the code below to understand why not including na.rm=TRUE might be a problem:

orca_tib |> 
  dplyr::summarise(
    min_attack= min(attack_duration)
  )
# A tibble: 1 × 1
  min_attack
       <dbl>
1         NA

The min value in our new summary table is returned as NA - a missing value.

An NA in a summary table often means that there’s something iffy about the variable that we’re trying to summarise. In our case, this would the attack_duration. A value of NA indicates that this variable is very likely to have some missing values.

We can use the method from explained in Tutorial 2 to explore whether attack_duration has missing values. To check for a missing values in a particular variable, we can use filter() to return any cases that DO have an NA. If the tibble is empty with no rows, there were no missing values in that variable. Otherwise, there were.

orca_tib |> 
  dplyr::filter(is.na(attack_duration))
# A tibble: 1 × 6
  attack_id date_of_attack month attack_duration music_genre band     
      <dbl> <date>         <dbl>           <dbl> <chr>       <fct>    
1       464 2023-11-12        11              NA Heavy Metal Powerwolf

Turns out, attack_duration does indeed have one missing value, because this table has one row. Now, we could remove this data point from the dataset using dplyr::filter(), but let’s say that for the time being, we don’t want to discard any data.

The argument na.rm = … allows us to specify how we want to deal with the missing values. By default, this is set to FALSE. So if there are any missing values in the variable, the function will not NOT ignore them and will instead return a NA. For example:

some_variable <- c(6, 4, 1, NA, 3, 7)
min(some_variable, na.rm = FALSE)
[1] NA

The code above returns NA. Because FALSE is the default value for na.rm =, we don’t need to specify it for this behaviour to occur. The function will automatically return NA by default:

min(some_variable)
[1] NA

But this doesn’t help us. We want the function to ignore the missing values, and tell us the smallest value that occurs in that variable disregarding NAs. We can set the value of the na.rm argument to TRUE to achieve this behaviour:

min(some_variable, na.rm = TRUE)
[1] 1

If we want the summary of attack_duration, we need to add the na.rm argument to the min() function in our code.

Let’s add this function to our summary table to find out the minimum value in our dataset for attack_duration.

Task 4

Adapt the code below to create a new summary column called min_attack. The column should contain the minimum value of the attack_duration variable.

# add the minimum value to the summary table
orca_tib |> 
  dplyr::summarise(
    n_cases= dplyr::n(), 
    new_summary_column_name = some_code_that_returns_the_minimum
  )

new_summary_column_name can be, again, any name. You can call it jessica if you’d like and the code will work fine, but that isn’t particularly informative - that is, it doesn’t quickly and succinctly tell you what the variable name contains. What would be an informative name for a column that contains the minimum value of a variable?


Next


some_code_that_returns_the_minimum is the function that we’re using, which in this case is min(). Scroll a little bit further up to see how it can be used.


The complete code looks like this:

orca_tib |> 
  dplyr::summarise(
    n_cases= dplyr::n(), ## Remember the comma!
    min_attack= min(attack_duration, na.rm = TRUE)
  )
# A tibble: 1 × 2
  n_cases min_attack
    <int>      <dbl>
1     164         12
Remember the comma!

Note how the line with n = dplyr::n(), ends with a comma. If we’re adding new summary values after a line of code inside of summarise(), we need to end that line with a comma. The last line, in this case the line that creates the minimum, doesn’t need to end with a comma. dplyr functions like summarise() will (most of the time) happily pretend that the last line doesn’t have a comma even you accidentally add one. Functions from base R tend to be less forgiving and will often return an error.

Task 5

Add the remaining summary functions. Now that you know how to to use the min() function, you can use the same formula to add the other summary values of attack_duration to the table as well. Here are the functions that you’ll need to use:

  • max() returns a maximum value in a variable

  • mean() returns the variable mean (average) value

  • median() returns the median

  • sd() returns the standard deviation

All of these functions work in the same way when it comes to missing values, so remember to add the argument na.rm = TRUE for each of them. You can call the new columns in the summary table “max_attack”, “mean_attack”, and so on.

The first part of the code stays the same - you’ll be adding new lines after min = min(attack_duration, na.rm = TRUE),.


Next


For example, to add the line that calculates the maximum value, you can write:

orca_tib |> 
  dplyr::summarise(
    n_cases = dplyr::n(), 
    min_attack = min(attack_duration, na.rm = TRUE), 
    max_attack = max(attack_duration, na.rm = TRUE)
  )

Now try adding the remaining functions listed above in a similar way (always remember to change the column name and the function!)


The complete code looks like this:

orca_tib |> 
  dplyr::summarise(
    n_cases= dplyr::n(), 
    min_attack= min(attack_duration, na.rm = TRUE), 
    max_attack = max(attack_duration, na.rm = TRUE), 
    mean_attack = mean(attack_duration, na.rm = TRUE), 
    median_attack = median(attack_duration, na.rm = TRUE), 
    sd_attack = sd(attack_duration, na.rm = TRUE)
  )
n_cases min_attack max_attack mean_attack median_attack sd_attack
164 12 76 44.88344 44 11.78951

Using the beautiful output you’ve just created, answer the following questions about the data.

Question 4

What was the average duration of an orca attack in the sample?

Question 5

On average, much did each attack deviate in duration from the mean?

Confidence Intervals

The final element that we’d like to include in our descriptives are confidence intervals. Make sure you review this week’s lecture if you don’t quite remember what confidence intervals are for, or check out the box below for a quick reminder!

Confidence intervals are useful, because they allow us to infer something about the population. The lower and upper limit of a 95% confidence interval tells us the plausible range of values that the true population value could be (assuming our sample is one of the 95% producing confidence intervals that actually contain the true population value).

The function that we’re going to use to get confidence intervals into the our summary is mean_cl_normal() from the ggplot2 package.

Like the other functions, mean_cl_normal() also takes the variable names, though there’s a slight twist to it. The general use of the function is:

some_dataset |> 
  dplyr::summarise(
    ci_lower = ggplot2::mean_cl_normal(variable_name)$ymin,
    ci_upper = ggplot2::mean_cl_normal(variable_name)$ymax
  )

We’re creating two summary columns, called ci_lower (“confidence interval - lower”) and ci_upper (“confidence interval - upper”). The way this function differs from the other functions is that we also need to specify whether we want the upper or the lower confidence interval. We do this by adding $ymin and $ymax, respectively, at the end of the line.

We can have a look at the below to get some sense of how the function works:

some_variable <- c(6, 4, 1, NA, 3, 7)
ggplot2::mean_cl_normal(some_variable)
y ymin ymax
4.2 1.235568 7.164432

First thing to note - this function is not fussed about missing values and ignores them by default. More importantly though, the function returns a small summary table with 3 values:

  • y - the mean of the variable (we don’t need this value)

  • ymin - the lower limit of the confidence interval

  • ymax - the upper limit of the confidence interval

This is not going to be particularly helpful if we try to include it in the summarise() function like we’ve been using above. The summarise() function needs only one value for each column that it creates, but instead we have three columns built into a table.

We need to somehow pry out the individual values. This is where the dollar sign $ operator comes in handy. Remember that $ can be used to print values from columns in a dataset. For example if we wanted to print all of the attack_id values in our orca_tib data, we could run:

orca_tib$attack_id
  [1]  12  18  19  21  23  24  26  29  33  34  36  41  42  46  48  55  57  59
 [19]  60  62  65  76  80  86  90  92 100 101 105 106 107 108 110 111 118 122
 [37] 125 131 133 142 143 146 147 158 161 166 167 169 171 174 177 181 183 186
 [55] 189 191 194 196 198 202 207 210 214 217 218 220 221 225 235 236 237 243
 [73] 245 247 248 255 256 257 262 268 270 271 273 275 280 282 285 286 288 290
 [91] 291 296 300 306 308 309 311 313 314 315 316 319 320 321 323 327 328 330
[109] 337 341 348 349 350 351 352 357 358 359 360 363 370 371 373 380 384 385
[127] 398 404 407 409 410 413 414 418 424 425 426 427 428 429 433 434 435 440
[145] 443 444 446 448 449 452 454 458 460 462 463 464 465 466 467 470 477 478
[163] 484 485

We want to print the ymin and ymax values from the table created by mean_cl_normal(). We do so by running:

ggplot2::mean_cl_normal(some_variable)$ymin
[1] 1.235568

and

ggplot2::mean_cl_normal(some_variable)$ymax
[1] 7.164432

Task 6

Create a summary table containing the number of cases, as well as the minimum, maximum, mean, median, and standard deviation, and the lower and upper limits of the confidence interval for attack duration.

  • Save the result into an object called orca_sum.

  • Inspect orca_sum to see the results.

You can add new lines in the same way as previously - i.e. the general form of each new line is:

column_name = some_function.


Next


For the lower confidence interval, the form of some_function is ggplot2::mean_cl_normal(some_variable)$ymin, where some_variable is replaced with the name of the variable you’re interested in.


Next


For upper CI, the code is almost identical, but you need to change ymin to ymax (and, of course, the column name).


The final code is below. Note two things here:

  1. We’re saving the changes to a new object using the assignment operator: orca_sum <- orca_tib. (Be careful NOT to assign the output to the same name as the dataset, or you will overwrite your dataset!)
  2. At the end of the code, we’re calling the object orca_tib because we want to see the results. If your code runs without error, but doesn’t display anything, it might be because you haven’t called the object.
# create a summary table with n, min, max, mean, median, sd, ci_lower, ci_upper
orca_sum <- orca_tib |> 
  dplyr::summarise(
    n_cases= dplyr::n(), 
    min_attack= min(attack_duration, na.rm = TRUE), 
    max_attack = max(attack_duration, na.rm = TRUE), 
    mean_attack = mean(attack_duration, na.rm = TRUE), 
    median_attack = median(attack_duration, na.rm = TRUE), 
    sd_attack = sd(attack_duration, na.rm = TRUE), 
    ci_lower = ggplot2::mean_cl_normal(attack_duration)$ymin,
    ci_upper = ggplot2::mean_cl_normal(attack_duration)$ymax
  )

# inspect the results
orca_sum
n_cases min_attack max_attack mean_attack median_attack sd_attack ci_lower ci_upper
164 12 76 44.88344 44 11.78951 43.05993 46.70694

Question 6

Complete the interpretation of the confidence intervals based on the results of the code above. Round to 2 decimal places:

“Assuming our sample is from the 95 percent producing confidence intervals that contain the true population value, then the average value for orca attack duration in the population lies between and .”

Grouped Summaries

We’re interested in comparing attack_duration for attacks when the crew played the heavy metal music compared to attacks with the Shrek soundtracks, so it would be useful to have all of the summary values above calculated separately for each of group.

One approach would be to use the dplyr::filter() function - we could create two separate datasets, one for heavy metal, one for Shrek, and then compute the summaries from above for each of the “subdatasets”.

But this would be a quite wordy approach with a lot of code repetition. Lucky for us, there’s a much more efficient way create grouped summaries.

To tell R that we want summaries calculated separately for different groups, we can use the group_by() function from {dplyr}. When a dataset is piped into group_by(), any calculations that follow will be carried out within the groups in the variable that is specified as an argument in group_by(). This means that for the summarise() function, any summaries will be computed separately for groups. The general use of the function is:

dataset_name |> 
  dplyr::group_by(some_grouping_variable) |> 
  dplyr::summarise(
    summary_column_name = instruction_to_compute_summary_value
  )

We’re starting as before - by specifying the dataset we want to use. But before we move on to summarise, we add another line into the pipeline that specifies the grouping. Note that the second line also ends with a pipe. We’re basically taking the grouped dataset we’ve created with the second line, and piping it into the dplyr::summarise() function.

In our case, the grouping variable is music_genre, which is a categorical variable denoting which type of music was played during the attack.

Task 7

Create a grouped summary table that contains descriptives of attack_duration computed separately by music genre.

  • As above, include number of attacks, minimum, maximum, mean, median, standard deviation and confidence intervals.

  • Save the result into an object called orca_sum_grouped.

  • Print the result and inspect it.

The code for this task is almost entirely identical to the code from the previous task - but remember to change the name of the object you’re assigning into. This time, the left hand side of <- should say orca_sum_grouped, not orca_sum. This is because you don’t want to overwrite the orca_sum object you’re previously created.

Remember to call this object at the end of the code chunk to display the results.


Next


The line you’re adding just after the first line (before the line starting with dplyr::summarise) needs to group the output by the variable music_genre, and end with a pipe.


# create a grouped summary table with n, min, max, mean, median, sd, ci_lower, ci_upper

orca_sum_grouped <- orca_tib |> 
  dplyr::group_by(music_genre) |> 
  dplyr::summarise(
    n_cases= dplyr::n(), 
    min_attack= min(attack_duration, na.rm = TRUE), 
    max_attack = max(attack_duration, na.rm = TRUE), 
    mean_attack = mean(attack_duration, na.rm = TRUE), 
    median_attack = median(attack_duration, na.rm = TRUE), 
    sd_attack = sd(attack_duration, na.rm = TRUE), 
    ci_lower = ggplot2::mean_cl_normal(attack_duration)$ymin,
    ci_upper = ggplot2::mean_cl_normal(attack_duration)$ymax
  )

# inspect the results
orca_sum_grouped
music_genre n_cases min_attack max_attack mean_attack median_attack sd_attack ci_lower ci_upper
Heavy Metal 89 12 71 40.20455 40.5 11.01009 37.87173 42.53736
Shrek 75 25 76 50.37333 50.0 10.25588 48.01367 52.73300

In the output, we get (almost) exactly the same columns as before, which are the ones that we created in the summarise() function. In addition to the columns we created, we also have a new column at the start: music_genre, which contains all the unique values in the music_genre variable in orca_tib. Then, thanks to group_by(), we also have more rows than before: one for each unique value in the music_genre variable. So, all of the descriptives in this table are calculated the same way as the overall table, just within each music genre group. This allows us to compare the attack duration between music genres - finally getting at our original research question!

Use the grouped summary in orca_summed_group to answer the following questions.


Question 7

Which music genre had more orca attacks recorded?

Question 8

Which genre had higher average duration of the attacks?

Question 9

Which genre showed greater variance in scores?


Formatting Tables

Now we have a summary table (or, rather, tibble) that contains all the information we said we wanted to answer our research question. The last step is to format it nicely, so that we can present it in a document or presentation. Raw datasets are difficult to read and not correct for formal reporting, so we are instead going to apply some nice formatting to turn our orca_sum_grouped dataset into a beautifully formatted HTML table.

To do this, we’re going to use two functions: knitr::kable() and kableExtra::kable_styling(). These functions together transform our functional but ugly tibble of summary scores into a nicely formatted summary table. All we have to do is pipe our summary tibble into the knitr::kable() function, and then on again into kableExtra::kable_styling(). Have a look:

orca_sum_grouped |> 
  knitr::kable() |> 
  kableExtra::kable_styling()
music_genre n_cases min_attack max_attack mean_attack median_attack sd_attack ci_lower ci_upper
Heavy Metal 89 12 71 40.20455 40.5 11.01009 37.87173 42.53736
Shrek 75 25 76 50.37333 50.0 10.25588 48.01367 52.73300

If you’re wondering why the output above looks the same as all the other output in these tutorials, that’s because we’ve been secretly using kable() behind the scenes to make the tables look nice in these tutorials for you to read. To see the difference this makes, you’ll need to compare the output of the unformatted tibble orca_grouped_sum vs the output of the code above in your own Quarto notebook.

If you are using a dark theme for RStudio, you may find that running the code above in your Quarto document just produces a blank white table. This is a side effect of the dark theme, unfortunately. You can see the values by highlighting them with your mouse, or by switching to a light theme instead (in Tools > Global Options > Appearance).

This is an improvement - but we can definitely do better!

  • First, the column names are pretty ugly. Variable names like music_genre are great for working with data in R, but they’re not acceptable for formal presentation. Instead, we should replace them with human-readable names, like “Music Genre”.

  • Second, some of the values in our table have quite a few digits after the decimal point. APA formatting style is rounding to two decimal places, so we should do that for any long decimals in our table.

  • Finally, to help any people who might want to read this table, we should include a short caption to explain what is in the table at a quick glance.

Lucky for us, the knitr::kable() function contains arguments to do all three of these things!

Task 8

Open the help documentation for the knitr::kable() function to find out the names of the arguments to make the changes described below, then make those changes to output a nicely formatted HTML table.

  • Replace the existing column names with nicely formatted, human-readable ones.
  • Round all digits to two decimal places.
  • Add a caption.

To open the help documentation, run ?knitr::kable() or help(knitr::kable) in the Console.


Next


The Arguments section of the help documentation will be the most useful here. Which three arguments correspond to the three elements we want to change? If you’re running into errors, try adding them one at a time and running the code to see what it does, instead of adding them all at once.


Next


The col.names argument needs a vector of new names, which will be applied, in order, to the columns in the table. You can create this vector of names using c(), like this:

c("Column 1 Name", "Column 2 Name", "And So On")

This vector of names must have the same number of names as there are columns in the table, or you will get a weird error: length of 'dimnames' not equal to array extent. If you need to, you can call the orca_sum_grouped object or run names(orca_sum_grouped) to remind yourself how many columns there are and what they contain.


Next


The digits argument works the same way as the digits argument in the round() function - it just needs a number for the number of decimal places to round to.


Next


The caption argument needs a single string in “quotes” providing the caption for the table. This can be anything you like, but it’s nice if it’s succinct and informative.


orca_sum_grouped |> 
  knitr::kable(
    col.names = c("Music Genre", "N", "Min", "Max", "Mean", "Median", "SD", "CI (lower)", "CI (upper)"),
    digits = 2,
    caption = "Descriptive statistics for attack duration by music genre played"
  ) |> 
  kableExtra::kable_styling()
Descriptive statistics for attack duration by music genre played
Music Genre N Min Max Mean Median SD CI (lower) CI (upper)
Heavy Metal 89 12 71 40.20 40.5 11.01 37.87 42.54
Shrek 75 25 76 50.37 50.0 10.26 48.01 52.73

 

Very well done today! This is the end of the required material for this week.

There are two more sections below: an optional section on interpreting confidence intervals in plots, and the weekly ChallengR. As always, we encourage you to give them a try!

 

Visually Interpreting CIs

Optional!

This section is entirely optional content. Working through it will help you when reading papers and evaluating research in general, but you won’t be assessed on it on this module. However - you will need if you want to attempt this week’s ChallengR!

So far we haven’t tested any hypotheses and we’re going to leave this exciting adventure for the upcoming weeks, but to start thinking about it, let’s explore how you can guess whether two groups might different at a statistically significant level using confidence intervals.

Confidence intervals are typically represented on a plot by “whiskers”, or lines sticking out above and below the mean (which is often represented by a dot or similar small point). The length of the whiskers corresponds to the length of the confidence interval, with the tips of the whiskers ending at the upper and lower bounds of the confidence interval. The “interval” stretches the whole width from the upper to the lower bound, with the mean right in the middle.

The general rule of thumb is this: If the intervals of two groups overlap by less than half of the whisker (half the length of the line going from the point to the end of the interval), then the difference between the two groups will be statistically significant at \(\alpha\) = 0.05. If the overlap is just less than half of the line, then p-value for that comparison will be just less than 0.05. Less overlap will generally be associated with smaller p-value.

Some examples might give you a better idea of how this works.

Here we have a comparison of two groups. The shaded portion of the plot indicates the space between the ends of the confidence intervals. In this case, the confidence intervals don’t overlap at all, so this difference is very likely statistically significant:

The figure below shows an example with some level of overlap. This time, the shaded portion shows how much the two intervals do overlap. This is still not half of the whisker length, so we can assume that the p-value for this comparison would also be less than 0.05, and therefore the difference could be considered statistically significant.

For contrast, the example below is very ambiguous. Eye-balling it, the overlap looks like about a half of each whisker. The p-value would likely hover somewhere around 0.05, either a little bit more or a little bit less.

Finally, the figure below shows a situation where the overlap is quite substantial. The difference in orca attack duration between these two groups is unlikely to be statistically significant at 0.05 (in other words, the p-value for this comparison would likely be more than 0.05).

Well done for making it this far! You’ve already learned so much on the module and you’re making great progress.

Not had enough? We love to hear it! Have a go at the ChallengR below.

ChallengR

ChallengR Time!

This task is a ChallengR, which are always optional, and will never be assessed - they’re only there to inspire you to try new things! If you solve this task successfully, you can earn a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers. (See the Games and Awards page on Canvas.)

In order to attempt this week’s ChallengR, you will need to have read through the previous optional section on visually interpreting confidence intervals.

There are no solutions in this document for this ChallengR task. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.

To further understand the differences between the effects of Heavy metal and Shrek music on orca attack duration, we might be interested in knowing whether there are differences in durations for different heavy metal bands.

Our dataset contains a variable called band which specifies whether the music that orcas heard was by Iron Maiden, Powerwolf, Opeth, Saor, or Lord of the Lost. That’s quite a few categories, so having all of the summaries for all these groups in a table might get a little overwhelming. So we can create (gg)plot instead - which will also allow us to visually compare the confidence intervals for these groups.

{ggplot2} Revision

This task requires you to create a plot using the {ggplot2} package. This package was introduced at the end of PAAS, but it’s the first time we’ve used it on this module. If you want a refresher, you might want to look back on weeks 10 and 11 of PAAS and the discovr_05 tutorial on visualising data with {ggplot2}. We will also start creating plots in this module regularly, so this will be excellent practice for the coming weeks!

For this plot, we’re going to look at the mean differences among the categories of the band variable. We’re interested in the differences in mean attack_duration and the respective confidence intervals.

Task 9

Use the orca_tib data to create a plot (using the ggplot2 package) with the attack_duration on the y axis, band on the x axis, with means and confidence intervals for each group.

The plotting function you can use here is stat_summary(fun.data = ????), which will plot the mean and CIs of the groups. Here, the ???? is the same function that we used previously in this tutorial to produce confidence intervals for our table.

We’re now in {ggplot} territory, so as a reminder, this is what a ggplot code typically looks like:

some_dataset |> 
  ggplot2::ggplot(aes(x = variable_for_the_x_axis, y = variable_for_the_y_axis)) + 
  plotting_function()

Next


Often the grouping variable goes on the x axis and the continuous/outcome variable goes on the y axis.


Next


The variable band goes on the x axis, and attack_duration goes on the y axis.


Once you have your plot, take the Week 4 ChallengR quiz on Canvas and use the plot to answer the questions. Good luck, and well done again!

Footnotes

  1. Also known by a more cuddly name - the killer whales.↩︎