Skills Lab 05: t-test

Author

Important

The sample take-away paper is now live! You can find it in either of the following folders on Posit cloud: Week 5, Week 6, Take-away paper. This project will keep jumping out at you at every opportunity, because it’s absolutely crucial that you attempt to complete the tasks in the sample TAP on your own. This way you will get a sense of what’s going to be expected from in the real take-away paper. We’ll go over some of the tasks together in next week’s skills lab, but there won’t be time to cover every task. If you get stuck while working on this on your own, you can post on Discord, ask us in a practical session or book into a drop-in meeting.

Setup

Packages and data

Load the necessary packages:

library(tidyverse)
library(ggrain)

Data

Load the data:

smarvus_tib <- readr::read_csv("data/smarvus_data.csv")

Codebook

ricomisc::rstudio_viewer("smarvus_codebook.html", "data")

Hypothesis

We’re going to be testing the following hypothesis:

Participants who attended in-person practicals will score lower on statistics anxiety related to asking for help compared to participants who attended practical classes online.

Task 1: Quick data cleaning

We want to compare statistics anxiety for online and in-person attendees.

Which variables from the dataset can we use here?
Are the variables in the right format? Can we use them as they are or we need to do something?
Perform any necessary data cleaning and save the result into a new dataset called smarvus_prac

smarvus_prac <- smarvus_tib |> 
  dplyr::filter(in_person_practicals %in% c("Online", "In-person"))

Task 2: Data viz

Data visualisation is an important tool for explaring the data and spotting patterns that we’re interested in. Before plunging into ggplot2 have a think about the following:

What kind of things might we want the plot to show (and why)? (the answer is often similar to what we would want in a good summary table!)
What type of plot could we use? Bar plot? Histogram? Dot plot? Box plot? Something else?
Which variable can we plot on the x axis and what can we put on the y axis?
Which colour(s) should we use? [write “html colour picker” into google]

smarvus_prac |> 
  ggplot2::ggplot(data = _, aes(x = in_person_practicals, y = stars_ask)) + 
  stat_summary(fun.data = "mean_cl_normal") + 
  scale_y_continuous(limits = c(0,5), breaks = seq(0, 5 ,1)) + 
  labs(x = "Practical mode", y = "Statistics anxiety - asking for help (1-5)") + 
  theme_light()

Task 3: Raincloud plots

The simple plot above is fine but it is limited. It hides a lot of information - for example we don’t know what the distribution of the variables might look like. Confidence intervals are also not great for showing the dispersion of the data - we might want to think of some alternatives to plot

Use the ggrain package to create a “raincloud” plot.
Change the fill aesthetic to be split by the modality of practical sessions (different colour for in-person vs online)
Add the means and confidence intervals to display on top of the raincloud plot
Add informative labels
Select appropriate scale and breaks for the y axis
Change the default theme

At the very least, we could create something like this. Note that technically speaking, fill is not necessary here because the X axis splits the two groups clearly enough. We’re adding it for practice, and so you have an example code of how fill works:

smarvus_prac |> 
  ggplot2::ggplot(data = _, aes(x = in_person_practicals, 
                                y = stars_ask, fill = in_person_practicals)) + 
  ggrain::geom_rain(alpha = 0.5) + 
  stat_summary(fun.data = "mean_cl_normal", colour = "green") + 
  scale_y_continuous(limits = c(1,5), breaks = seq(1, 5 ,1)) +
  labs(
    x = "Practical mode", 
    y = "Statistics anxiety - asking for help (1-5)", 
    fill = "Practical mode", 
    ) + 
  theme_light()

Now, it’s understandable that you might want to add more pizzazz to your plot. I’ve added the code below with more examples of how you can customise a ggplot, but please note that every additional line of code on the plot below is extra and for your own learning only. Take a look at the code and give us a shout on discord or in a practical session if you’d like to talk through some of this.

smarvus_prac |> 
  ggplot2::ggplot(data = _, aes(x = in_person_practicals, 
                                y = stars_ask, 
                                fill = in_person_practicals, 
                                colour = in_person_practicals)) + 
  ggrain::geom_rain(alpha = 0.5, point.args = list(alpha = 0.05)) + 
  stat_summary(fun.data = "mean_cl_normal") + 
  scale_y_continuous(limits = c(1,5), breaks = seq(1, 5 ,1)) +
  scale_fill_manual(values = c("#00948f", "#7100bd")) + 
  scale_colour_manual(values = c("#00948f", "#7100bd")) + 
  labs(
    x = "Practical mode", 
    y = "Statistics anxiety - asking for help (1-5)", 
    ) + 
  guides(fill = "none", colour = "none") + 
  theme_light()

Task 4: t-test

Perform a t-test to test the hypothesis specified at the beginning
What’s the mean difference between the two practical attendance modes?
Is this difference statistically significant?
Report the results using the following form:

estimate_name(degrees_of_freedom) = estimate_value, p = p_value, M_diff = difference_in_means, 95% CI = [CI_lower, CI_upper}

smarvus_prac |> 
  t.test(stars_ask ~ in_person_practicals, data = _)


    Welch Two Sample t-test

data:  stars_ask by in_person_practicals
t = -4.1088, df = 530.36, p-value = 4.607e-05
alternative hypothesis: true difference in means between group In-person and group Online is not equal to 0
95 percent confidence interval:
 -0.4699718 -0.1659371
sample estimates:
mean in group In-person    mean in group Online 
               2.620879                2.938834

We need to work out the mean difference from this bit of the output:

mean in group In-person    mean in group Online 
               2.620879                2.938834

We can subtract the means from each other: 2.620879-2.938834 = -0.317955. Rounding this to two decimal places the mean difference i -0.32. The confidence interval reported in the output is an interval for this difference:

95 percent confidence interval:
 -0.4699718 -0.1659371

The p-value is report as 4.607e-05. We can convert this into decimal numbers (which are much easier to read), by moving the decimal point to the left by 5 places (5 because of e-05). 4.607e-05 therefore becomes 0.00004607. Typically, a p-value is reported to 3 decimal places. Because this value is so tiny, we can report it as p < .001 (we’re omitting the first zero here).

Now we’ve got everything we need to report the result of the test. We can write something along the lines of:

The difference between the anxiety scores of the online group and the in-person group was statistically significant, M_Diff = -0.32, t(530.36) = -4.11, p < .001, 95% CI [-0.47, -0.17].

Render!

Render your document to see it in all its glory!

References

A really fun meta-research paper discussing why bar charts are often not a great idea: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128