Lab 07: Midterm / review

Avengers, World Happiness

data wrangling

data visualization

data importing

Goals

This lab will assess the following skills:

Data wrangling
Data visualization
Importing data
Clarity of code and written communication
Ability to use data to investigate questions

Getting started

Download lab-07.qmd, avengers.csv and World Happiness 2023.xlsx from the course website
Place the .qmd file in your STAT_4380 folder on your computer
Place the .csv and .xslx data files inside the data subfolder within STAT_4380

Packages

We will use the following packages for this lab:

library(tidyverse)
library(kableExtra)
library(readxl)
library(here)

Data

You will analyze two different datasets in this lab:

avengers (Exercises 1 - 5): data on 173 Marvel characters
happiness (Exercises 6 - 9): data on World Happiness metrics for 165 countries from 2008 to 2023

Avengers Data

This data was originally collected for a FiveThirtyEight article. The version of the avengers data we will work with here can be found in the avengers.csv file on the course website. The code below will load the data (assuming it is placed appropriately inside a data subfolder of your R project).

avengers <- read_csv("data/avengers.csv")

This dataset includes information about characters across the entire Marvel Cinematic Universe (MCU), so some of the names will be familiar if you are a fan of the films or comics. Don’t worry if you aren’t a Marvel fan; no background knowledge is needed to successfully complete this lab!

We will focus on the following variables in this lab:

Header	Definition
`name`	The full name or alias of the character
`appearances`	The number of comic books that character appeared in as of April 30
`current`	Is the member currently active on an avengers affiliated team?
`gender`	The recorded gender of the character
`probationary_introl`	Sometimes the character was given probationary status as an Avenger, this is the date that happened. The value will be NA if the character was never given probationary status.
`full_reserve`	The month and year the character was introduced as a full or reserve member of the Avengers
`year`	The year the character was introduced as a full or reserve member of the Avengers
`years_since_joining`	2015 minus the year
`death1`	Yes if the Avenger died, No if not.
`return1`	Yes if the Avenger returned from their first death, No if they did not, blank if not applicable

See FiveThirtyEight’s GitHub repo for the full codebook.

Exercises

Exercise 1

You are interested in creating a data frame with the most “classic” Avengers.

Create a new data frame that only includes Avengers that were 1) created in 1970 or earlier and 2) were NOT given probationary status.
Call the new data frame classic_avengers.

Hint: is.na() will be useful. Check: your new data frame should have 27 observations.

Exercise 2

Who are the newest classic Avengers?

Using a single pipeline:

Create a new variable called years_served that represents the number of years served as of 2025. (Hint: you can use either the year variable or years_since_joining variables to do this.)
Arrange the data appropriately to display the newest classic Avengers first.
Display only necessary columns and the first 10 rows in a nicely formatted table.
In addition to the table, report the names of the three newest classic Avengers and how long they have served in your narrative text.

Exercise 3

Has the percentage of female Avengers changed over time?

To explore this question, compute the percentage of female Avengers in the classic_avengers dataset and compare it to the percentage of females among all Avengers.

What do you conclude based on these results?

Exercise 4

Sort the full avengers dataset in descending order of appearances
Display only the top 5 observations and only the columns name, appearances, death1, and return1
What do you observe about these Avengers in terms of deaths and returns?

Exercise 5

Do characters who die at least once tend to appear more or less often than characters who never die?

Create a visualization using the full avengers dataset to examine the distribution of appearances by whether or not character has died at least once.
There are multiple correct way to visualize this, just make sure your visualization is clean and easy to read.
What do you learn about Marvel movies from your results?

World Happiness Data

The World Happiness Report is produced annually by the Gallup World Poll. According to their website,

“Since creating the World Poll in 2005, Gallup has conducted studies in more than 160 countries and territories that are home to more than 98% of the world’s adult population. The World Poll survey includes more than 100 global questions as well as region-specific items. Gallup asks residents from Australia to Zimbabwe the same questions, every time, in the same way. This makes it possible to trend data from year to year and make direct country comparisons.” - World Happiness Report

The World Happiness 2023.xlx data file includes data from 2008 to 2023. The following code will read it in.

happiness <- read_excel("data/World Happiness 2023.xlsx")

Variable	Description or Question(s) Asked
country	Country name (used only as an identifier).
year	2008 – 2023.
happiness	Country average response to “Imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
logGDPpc	log₁₀ of Gross Domestic Product Per Capita (in 2011 US$).
social_support	Having someone to count on in times of trouble. Proportion of people in that country who responded “yes” to “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
life_exp	Life expectancy, in years, of a healthy child at birth.
freedom_choices	Proportion of people in that country who responded “satisfied” to “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
freedom_choices_cat	A categorization of `freedom_choices` into “low”, “med”, or “high”.
generosity	Residual of national average of response to “Have you donated money to a charity in the past month?” on GDP per capita.
corruption	Average proportion of people responding “yes” to “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?”.
affect_pos	Average of three questions: “Did you experience happiness, laughter, and enjoyment during a lot of the day yesterday?”
affect_neg	Average of three questions: “Did you experience worry, sadness, and anger during a lot of the day yesterday?”

Exercise 6

Filter the data to keep only the year 2023.
For each of the following variables (by themselves), produce an appropriate visualization:
- happiness
- logGDPpc
- life_exp
- social_support
- freedom_choices_cat
- corruption

Exercise 7

Using the 2023 data, produce graphs of happiness as the dependent (y) variable and each other variable from Exercise 6 as the independent (x) variables (one at a time; 5 total visualizations).
For scatterplots, include a smoothed curve (geom_smooth) as well.
Which variables appear to be most correlated with happiness? Explain your answer.

Exercise 8

Which countries had the top 5 highest happiness ratings in 2023? The lowest 5?

Use code to identify these 10 countries and produce a nicely formatted table of the country names, along with their happiness ratings, life expectancy, corruption, and logGDPpc.

Exercise 9

Using the 2023 data, create a new variable called social_support_cat which has the value “high” if the value of that country’s social_support is above the median and 0 if it is not.
Create an appropriate visualization of this new variable by itself.
Then, create a scatterplot of happiness as the dependent (y) variable and logGDPpc as the independent (x) variables, with social_support_cat as the color aesthetic.

Exercise 10

Choose at least three, but no more than about eight, countries from the data that are of interest to you.

Explain why you chose those countries. (This part will not be graded – I am just curious)
For these countries, use only the 2023 data to produce at least a graph with happiness as the dependent (y) variable and another variable of your choice as the independent (x) variable, using a color aesthetic for each country.
Using all years of data, produce a graph of happiness over time, with a different color for each country you chose.

Exercise 11

Choose one of the two datasets - avengers or happiness - and comment on each of the following:

What you found through this analysis
What limitations might exist to the data/analyses
What additional data you would like to have to explore further questions

BONUS

State an additional research question you could investigate using either the avengers or happiness data. State the question, provide a visualization that investigates it, and comment on what your visualization shows.

Submission

Before submitting your .html (as a .zip file to Blackboard):

Check your code for neatness - add spaces and line breaks where appropriate to improve readability
Check visualizations for clean titles and labels
Suppress extraneous messages/warnings (e.g. set #| warning: false, #| message: false inside code chunks)
Ensure exercises are clearly labeled and your text responses are visually distinguished
Confirm neat organization and readable structure

Render one last time, check the .html file for accuracy, then convert to .zip file to upload to Blackboard.

Grading (100 pts)

Component	Points
Ex 1	8
Ex 2	12
Ex 3	8
Ex 4	6
Ex 5	12
Ex 6	12
Ex 7	10
Ex 8	8
Ex 9	8
Ex 10	8
Ex 11	8
BONUS	2