Skills Lab 07: Correlation and Chi-square

Author

Google doc: linktr.ee/analysingdata

R meet-up: Friday 11am at Pev 1 2D11

Setup

Packages and data

Load the necessary packages:

library(tidyverse)
library(GGally) # for nice correlation plots

Data

Load the data:

games_tib <- readr::read_csv("data/video_games_data.csv")

Variables in the dataset:

id: Participant’s ID
age: Participants age
game: Name of the video game
game_type: Game classification as “Shooter”, “Sports game”, “RPG” or “Animal crossing”
affect: Level of emotional affect measured from -6 (most negative) to + 6 (most positive)
affect_cat: Categorical version of the affect variable, with values “Negative” or “Positive”
life_sat: life Satisfaction
experience: Experience of playing video games (0-100)
hours: Hours spend playing video games per week

Correlation

Task 1: Create a correlation matrix

A good first step in an analysis is to explore the associations between variables
Select the continuous (numeric) variables in the dataset. Save this selection of columns in a new object called games_tib_cor
Create a correlation matrix (either numeric or visual)

The reason we’re selecting numeric variables is because functions cor() and GGally::ggscatmat() will calculate correlations for is, and this can only be done for numeric variables (doubles, integers, etc).

If you encounter an error like this:

Error in cor() : 'x' must be numeric

it means we’re trying perform a calculation on a variable that isn’t numeric. This can be with the cor() function, but it might pop up essentially with any function that performs mathematical functions (like mean or median).

games_tib_cor <- games_tib |> 
  dplyr::select(age, affect, life_sat, experience, hours)

Note how we’ve also saved into a new object called games_tib_cor using the assignment operator, instead of just saving over the same object. By selecting only 5 columns, we’re making a substantial change to the dataset. So it’s good to keep the original version of games_tib available in case we need it later (which in this case, we will do).

games_tib_cor |> cor()

                 age     affect    life_sat  experience        hours
age        1.0000000 0.14646774 0.111745366  0.49100762  0.171203480
affect     0.1464677 1.00000000 0.631929165  0.04129028  0.018558241
life_sat   0.1117454 0.63192917 1.000000000  0.03644992  0.005614853
experience 0.4910076 0.04129028 0.036449923  1.00000000 -0.012494233
hours      0.1712035 0.01855824 0.005614853 -0.01249423  1.000000000

games_tib_cor |> GGally::ggscatmat()

We also talked about how the objects produced by GGally are ggplot extensions, which means that we can modify the plots as we usually would, for example by adding a theme:

games_tib_cor |> 
  GGally::ggscatmat() + 
  theme_minimal()

Task 2: Run correlation tests

Run correlation tests on the numeric variables
Which relationships are statistically significant?

The previous functions tell us the sizes of the correlation coefficients but we don’t know whether the relationships are statistically significant. That is, if in reality, the relationship between the variables doesn’t exist, how likely are we to find a correlation coefficient as large as the one we observed in a sample?

correlation::correlation(games_tib_cor)

# Correlation Matrix (pearson-method)

Parameter1 | Parameter2 |        r |        95% CI | t(16977) |         p
-------------------------------------------------------------------------
age        |     affect |     0.15 | [ 0.13, 0.16] |    19.29 | < .001***
age        |   life_sat |     0.11 | [ 0.10, 0.13] |    14.65 | < .001***
age        | experience |     0.49 | [ 0.48, 0.50] |    73.44 | < .001***
age        |      hours |     0.17 | [ 0.16, 0.19] |    22.64 | < .001***
affect     |   life_sat |     0.63 | [ 0.62, 0.64] |   106.24 | < .001***
affect     | experience |     0.04 | [ 0.03, 0.06] |     5.38 | < .001***
affect     |      hours |     0.02 | [ 0.00, 0.03] |     2.42 | 0.047*   
life_sat   | experience |     0.04 | [ 0.02, 0.05] |     4.75 | < .001***
life_sat   |      hours | 5.61e-03 | [-0.01, 0.02] |     0.73 | 0.464    
experience |      hours |    -0.01 | [-0.03, 0.00] |    -1.63 | 0.207    

p-value adjustment method: Holm (1979)
Observations: 16979

The column titled r contains the correlation coefficients, while the p-values in the column p. The are some large correlations - like the one between age and gaming experience, or affect and life satisfaction - that are statistically significant. This makes sense.

What makes less sense is that we also have some minuscule correlations that are close to 0, like affect and hours spent playing video games. The correlation is just 0.02, but the p-value is 0.047, which is just statistically significant. Note at the end of the table we have the number of observations listed, which in this case is nearly 17000. At a sample this large, almost everything will be statistically significant. It’s important to always consider effect size (correlation coefficient, mean difference, etc), not just the p-value.

Chi-square

Hypothesis

There will be an association between type of game (Animal Crossing vs Sports Game) and experiences of positive or negative affect.

Make a prediction! Who do you think is going to be more likely to experience positive affect? Players of Animal Crossing or players of sports games (car racing)?

The main difference here is that we’re now using the the categorical version of affect (positive vs negative) so both variables are categorical.

Task 3: Quick data cleaning

We’re interested in comparing the game “Animal crossing” against games classified as “Sports game” - filter the rows that only contain these two game types and save the new dataset into an object called games_tib_chi

Here’s a quick tip on how to check which values are present in a given column if we need to get their exact spelling etc:

games_tib$game_type |> unique()

[1] "Animal crossing" "Shooter"         "RPG"             "Sports game"

Filter the right rows and check values again:

games_tib_chi <- games_tib |> 
  dplyr::filter(game_type == "Animal crossing" | game_type == "Sports game")

games_tib_chi$game_type |> unique()

[1] "Animal crossing" "Sports game"

Note: remember how we didn’t over-write our dataset when working with correlations? Now we needed to use it again. If we had over-written it, we’d need to re-read the dataset, mess up the order in which the code naturally progresses, and that’s where 95% of rendering problems typically occur.

Task 4: Plotting!

Create a bar plot (yes, a bar plot!) showing the counts of participants across the two game types split by affect valence
Interpret the plot - does this it support your prediction?

If time left:

Change the default colours
Adjust axis labels

This is what we did in the session:

games_tib_chi |> 
  ggplot2::ggplot(aes(x = game_type, fill = affect_cat)) +
  geom_bar(position = "dodge")

The tutorial has a detailed breakdown of making plots like this so make sure to check it out. I just want to highlight that yes, we’re making a bar graph, but it’s a bar graph with counts - for all intents and purposes, this is the same use as a histogram, which is why I don’t consider it to be as big of a crime.

Some additional code to make the plot prettier:

games_tib_chi |> 
  ggplot2::ggplot(aes(x = game_type, fill = affect_cat)) +
  geom_bar(position = "dodge", width = 0.75) + 
  scale_fill_manual(values = c("darkmagenta", "darkcyan")) + 
  labs(x = "Type of game", y = "Count", fill = "Affect category") + 
  theme_light()

Task 5: Run Chi-square test

Run the test of association between type of game and affect category
Interpret the results - does the statistical test support your prediction?
Can we reject the null hypothesis?
Can we say that a specific type of video game causes a type of affect?

chi_test <- chisq.test(games_tib_chi$affect_cat, games_tib_chi$game_type)
chi_test


    Pearson's Chi-squared test with Yates' continuity correction

data:  games_tib_chi$affect_cat and games_tib_chi$game_type
X-squared = 147.94, df = 1, p-value < 2.2e-16

The chi-square test compares the two things: the counts/frequencies we would expect in every combination of the category if the null is true. That is, if the two variables are not associated whatsoever. These expected frequencies are then compared against the frequencies we actually observed in the dataset. We can ask R to print out both:

chi_test$expected

                        games_tib_chi$game_type
games_tib_chi$affect_cat Animal crossing Sports game
                Negative        950.5018    1361.498
                Positive       5580.4982    7993.502

chi_test$observed

                        games_tib_chi$game_type
games_tib_chi$affect_cat Animal crossing Sports game
                Negative            1217        1095
                Positive            5314        8260

References:

Video games and well-being paper source of the dataset):

https://royalsocietypublishing.org/doi/10.1098/rsos.220411