# load tidyverse:
library(tidyverse)
# load data:
<- readr::read_csv("data/tea_position.csv") chi_tea
Tutorial 07: Chi-square
In Hot Water
You may remember this horrific abomination interesting interpretation of the tea-making process that blew up a couple years ago. Many people, including my (Jennifer’s) English partner1, were thoroughly scandalised, but it raises an interesting question: does changing the order of the steps actually change the taste?
Here’s the video in question, if you would like a nice steaming cup of absolute outrage.
Americans making hot tea 🍵 #americanintheuk @mleemaster10
You may know someone - or be that someone yourself! - who is convinced that you can tell the difference when someone makes tea “wrong”. Well, whether there is a right or wrong way to make a cuppa is a matter of opinion2, but whether someone can accurately identify a difference in taste is a matter of science. So - let’s design an experiment and then analyse the data to find out!
This particular research question is very appropriate for our statistical test today. In 1935, botanist Dr Blanche Muriel Bristol claimed that she could tell whether the milk had been added before or after the tea. Her claim was apparently substantiated by her subsequent performance on eight randomly presented cups of tea, four with the milk first and four with the tea first. The person testing her on this claim was dapper young scientist, future founding father of statistical analysis, and resolute eugenicist Ronald Fisher, who subsequently described a version of this experiment in the 1935 paper “A Lady Tasting Tea”. This paper set out Fisher’s idea of the null hypothesis for the first time.
Data and Design
Our experiment today will be based on this excellent investigation by students at the University of Sheffield, which includes full descriptions of the results and instructions for doing your own tea taste test with your family, friends, or flatmates. In case you’re not up for making a few hundred cups of tea in the name of Science, we’ll use that study’s data instead, graciously provided by Dr Tom Stafford.
Task 1
Load packages and read data
- Load the tidyverse package
- Read in the data stored in
data/tea_position.csv
. I’ve called minechi_tea
, but you do you.
- Have a poke around this dataset to get familiar with it. How many participants were there in this experiment? How many variables? What are the variables called?
As we discussed in the lecture, a test of association (also called a test of independence) investigates whether two variables are associated with each other. In this case, we want to know whether there is an association between the location of the milk-first cup of tea (left or right), and the participant’s belief about the location of the milk-first cup of tea (left or right).
Task 2
Stop and think about this design before we go on
- What are the null and alternative hypotheses?
- What pattern would we expect to see in the data in each of these cases?
- What do you predict the result will be - do you think that people will be able to correctly identify the milk-first tea? Do you think you could?
- Write down your thoughts in the notebook document.
Visualising Counts with Bar Charts
Now that we have our predictions nailed down, let’s have a look at the data! As usual, it’s always important to visualise your data first thing.
Let’s build a beautiful plot to represent the count data we have. We’ll start with a basic plot and build it up from there to create a publication-worthy graph. Along the way, we’ll learn a few new options in ggplot2
. In the end, it will look like this:
Task 3
Create a bar chart.
Start by creating a basic bar chart of the actual_position
variable like the one below, and adding a theme to spruce it up.
- Create a base layer of a
ggplot
, withactual_position
on the x axis. - Add
+ geom_bar()
to the base layer. - Add a theme
- In the
aes()
function, add afill = believe_position
to split the x axis by this variable:
Hm…that doesn’t look quite right. Here, the fill
argument has filled the bars with different colours. This doesn’t make comparison very easy, so instead I want to split the two bars into four, based on their value of the second variable, believe_position
. In other words, I want geom_bar
to draw the bars in different position
s.
- Add
position = "dodge"
to thegeom_bar
function.
Hey, that’s looking better already! That’s basically the information that we want, and this would be enough if we were just using this graph to visualise the data for our own purposes. However, we’re practicing making our plots ✨snazzy✨, so let’s keep working on making it look good.
- Change the labels using the functions
scale_x_discrete()
,scale_y_continuous()
,scale_fill_discrete()
.
scale_x_discrete()
to change the name and labels on the x-axisscale_y_continuous()
to change the name, limits, and breaks on the y-axisscale_fill_discrete()
to change the name and labels on the legend
- Change the colours: Add a
type = c(...
argument toscale_fill_discrete()
to change the colours. Remember you will need to give two colours, one for “left” and one for “right”.
Now that’s looking good!
scale_...()
functions
We’ve just seen a couple of examples of functions with different names, that all follow a specific pattern. These scale_...()
functions adjust the settings of the axes, as we just saw, but you have to choose the right one based on two elements:
Which axis you want to change. This is the middle element, between the underscores. Here we had
_x_
for the x-axis,_y_
for the y-axis, and_fill_
for the grouping variable. Notice that these are the same names as the arguments in theaes()
function! For example, since we putbelieve_position
in thefill =
arguments inaes()
, we have to use thescale_fill_...()
function to change its settings.What kind of data you have. This is the last element, before the round brackets. Here we saw
_discrete
for categorical data, and_continuous
for continuous data.
In order to choose the correct scale_...()
variable, you have to know both of these things! R will complain if you try to use a function that doesn’t match up with the data you have.
Task 4
- Interpret the results based on the plot
- Write down your thoughts in your Quarto Doc
Chai-Square3 Analysis
So, we’ve made our predictions, and we’ve had a look at the data. Now we’re ready to conduct our \(\chi^2\) test.
Task 5
Perform the \(\chi^2\) analysis
- Use the
chisq.test()
function to perform a \(\chi^2\) test of association on thechi_tea
data. The way to specify the variables is a little different than we’re used to - check the help documentation if you get stuck! - Save your test output to a new object,
chi_tea_test
, to refer to it later. - Print the results
Pretty painless, eh?
Reporting \(\chi^2\) Analysis
Now that we’ve got our test result, let’s report it in APA style. This takes the general form:
(name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p, 95% CI [lower_bound, upper_bound])
Task 6
Report the results of the analysis
- Use the general format above and the values to type out the result of the \(\chi^2\) test.
- We should also describe in words what this result means. We essentially include the statistical result as a citation to give evidence for our claim. Have a go writing out a report of this statistical analysis.
That’s good progress! However, as mentioned in the lecture, just this test isn’t enough to fully understand the results of our study. This only tells us whether the two variables were associated, not the direction of the association. For a complete report, we should also include the observed frequencies.
You should never have any raw (unformatted) code or output in a report! Always report results in the text, in a figure, or in a formatted table.
Observed and Expected Frequencies
To finish off our reporting, it’s good practice to report the actual numbers or frequencies that went into the chi-squared test. Luckily, we can get this very easily out of the chi_tea_test
object we created earlier.
Task 7
Obtain and report observed and expected frequencies
- Use
str()
to look at the structure of thechi_tea_test
object.
- Print out the table of observed frequencies by pulling out
$ observed
from thechi_tea_test
object.
- Check the
expected
frequencies for your test. Is there a problem? - Put all of this together into one report of your findings that includes information about frequencies.
You might notice that the observed values are the same counts that appeared in our graph up above. Even though this information is presented visually there, it’s still a good idea to include these numbers in your report.
htest
objects
The object we’ve created to store the results of our test, which we’ve called chi_tea_test
, is in essence just a list of values that were calculated by the test. This particular list has a special class, htest
, for hypothesis tests. If you look back on previous work, you’ll see that our objects containing the results from t.test()
and cor.test()
were also htest
objects.
The benefit of being familiar with htest
objects is that they all work in a similar way. They are all lists of information - more information than it appears when you just call the object and print out the test results!
In this case, our output from chisq.test()
has some elements that are incredibly useful but not printed by default: $observed
and $expected
, which contain the observed and expected frequencies respectively. So, it’s a good idea to have a look inside these objects when you are reporting - it will often make your life a bit easier!
Recap
Well done conducting your \(\chi^2\) analysis. I highly recommend reading all of the T3 Team’s findings - they asked a lot of interesting questions and have done a great job presenting their results.
In sum, we’ve covered:
How to create bar graphs of count data
How to run the \(\chi^2\) analysis, read the output, and store it in an object
How to report and interpret the statistical results of these tests in real-world terms
Remember, if you get stuck or you have questions, post them on Discord, or bring them to practicals or to drop-ins.
Good job!
That’s all for today. See you soon!
ChallengR
This task is a ChallengR, which are always optional, and will never be assessed - they’re only there to inspire you to try new things! If you solve this task successfully, you can earn a bonus 2500 Kahoot Points. You can use those points to earn bragging rights and, more importantly, shiny stickers. (See the Games and Awards page on Canvas.)
There are no solutions in this document for this ChallengR task. If you get stuck, ask us for help in your practicals or at the Help Desk, and we’ll be happy to point you in the right direction.
Task 8
Is there a statistically significant association between the types of game played (shooter vs RPG) and different age categories (under 35, 35-45, over 45)?
Your task is to answer the research question above. In doing so, complete the following steps and then answer the questions in the Week 7 ChallengR Canvas quiz.
- Read in the data using the code below.
- Check whether you need to make any changes to the dataset before running the test. If so, carry out these changes.
- Create a summary table that shows you the counts of individuals within the two game types and the three age categories.
- Run a statistical test to answer the research question.
- Optional: Create an appropriate data visualisation to help you interpret the results.
# load data:
<- readr::read_csv("data/video_games_data.csv") games_tib
Once you’re happy with your code, complete the quiz on Canvas to those well-deserved Kahoot points: Week 7 ChallengR quiz on Canvas. Good luck, and well done again!
Footnotes
In L’s words, just now watching it again: “How did she get every single bit of it wrong?” And L doesn’t even drink tea!↩︎