Lab 03 – What should I major in?

Data wrangling
Data visualization
College majors
FiveThirtyEight

Photo by Marleena Garris on Unsplash

Goals

In this lab, you will:

  • Explore relationships between college majors, earnings, and employment
  • Practice data wrangling and visualization skills
  • Continue developing a reproducible data analysis workflow

Getting Started

  • You will be working in your Lab 03–04 Groups (see Blackboard).
  • Download the .qmd file for this lab from our class GitHub repo.
  • Refer back to Lab 01 for detailed workflow and submission instructions.

Packages

We will use the following packages:

  • tidyverse: data wrangling and visualization
  • scales: formatting labels
  • ggridges: ridge plots
  • kableExtra: table formatting
  • fivethirtyeight: data source

Data: College Majors and Earnings

In this assignment we explore data on college majors and earnings, specifically the data in the FiveThirtyEight story “The Economic Guide To Picking A College Major”.

These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about how raw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used. This data is over a decade old at this point, but you could pull and analyze more recent ACS data for your project! The ACS includes many more survey topics than those in this analysis.

We should also note that there are many considerations that go into picking a college major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.

The dataset is included in the fivethirtyeight package and is called college_recent_grads.

?college_recent_grads

The college_recent_grads data frame is a trove of information. Let’s think about some questions we might want to answer with these data:

  • Which major has the lowest unemployment rate?
  • Which major has the highest percentage of women?
  • How do the distributions of median income compare across major categories?
  • Do women tend to choose majors with lower or higher earnings?

In the next section we aim to answer these questions.


Exercises

Respond to each exercise with clearly labeled code and written interpretation.


NoteExercise 1

How many observations and variables are in this dataset? What does each row represent?

NoteExercise 2

Which major category is the least popular (fewest total graduates)?

Use the following scaffold:

college_recent_grads |>
  group_by(major_category) |>
  summarise(total_per_category = ___(___)) |>
  arrange(___)
NoteExercise 3

Which majors have the lowest unemployment rates? Create a neatly formatted table to answer this question, keeping in mind the following:

  • Make a decision about how many rows to display in your output
  • Make sure to sort your data in a reasonable way
  • Choose only a small subset of variables to display in your output to improve readability.
  • Add kbl() |> kable_minimal() to the end of your pipeline to clean up how your table appears in your html.


The function slice and/or its variations might be helpful in displaying only a subset of rows. Check out the help documentation by runing ?slice in the console.

You can check out more table formatting options using the kableExtra package here

NoteExercise 4

Recreate the table from Exercise 3, but display unemployment as percentages rounded to two decimal places. To do this, create a new variable called unemployment_perc that uses the percent() function to convert the proportions to percentages.


Hint: read the ?percent() documentation to see how the accuracy argument works.

NoteExercise 5

Which majors have the highest percentage of women? Create a nicely formatted table and comment on what you see.

NoteExercise 6

Create a similar table for majors with the lowest percentage of women.

How do the distributions of median income compare across major categories?

There are three income variables reported in this data frame: p25th, median, and p75th. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia)

NoteExercise 7

Why do we often use the median rather than the mean to describe income?

The question we want to answer is “How do the distributions of median income compare across major categories?” We need to do a few things to answer this question: First, we need to group the data by major_category. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data.

NoteExercise 8

Create a histogram of median income for all majors (ignore categories).

  • Choose a reasonable binwidth and justify your choice. You might ask yourself: “What would be a meaningful difference in median incomes?” $1 is obviously too little, $10000 might be too high. Describe what the visualization reveals.
  • Describe what the distribution reveals

Also include the following code to produce summary statistics for the median income:

summary(college_recent_grads$median)
NoteExercise 9

Use a ridge plot to compare the distribution of median income across major categories. Comment on what you observe.

Now that we’ve seen the shapes of the distributions of median incomes for each major category, we should have a better idea which summary statistic to use to quantify the typical median income - the mean or the median? (Yes, we are talking about the mean/median of medians!).

NoteExercise 10

Which major category has the highest typical median income? Explain how you defined “typical.”

All STEM fields aren’t the same

One of the sections of the FiveThirtyEight story is “All STEM fields aren’t the same”. Let’s see if this is true.

NoteExercise 11

Use the following to create a new vector called stem_categories that lists the major categories that are considered STEM fields.

stem_categories <- c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Then, use this to create a new variable in our data frame indicating whether a major is STEM or not.

college_recent_grads <- college_recent_grads |>
  mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))










Let’s unpack this code: with mutate we create a new variable called major_type, which is defined as "stem" if the major_category is in the vector called stem_categories we created earlier, and as "not stem" otherwise.

%in% is a logical operator. Other logical operators that are commonly used are

Operator Operation
x < y less than
x > y greater than
x <= y less than or equal to
x >= y greater than or equal to
x != y not equal to
x == y equal to
x %in% y contains
x | y or
x & y and
!x not
NoteExercise 12

Which STEM majors have median earnings less than or equal to the overall median of medians (which we found earlier to be $36k?

Show only the major name and income percentiles. Sort from highest to lowest median income.

NoteExercise 13

What types of majors do women tend to major in? Create a scatterplot of:

  • median income (y-axis)
  • proportion of women (x-axis)

Color points by STEM vs non-STEM. Describe the relationships you observe.

NoteExercise 14

Write a brief reflection summarizing:

  • key findings
  • limitations of the data
  • additional data you would want to explore further questions
NoteBonus

Propose an additional question you could explore using these data. Provide a visualization and comment on what it shows.

Submission

Before submitting your .html:

  • Check code readability
  • Suppress unnecessary warnings and messages
  • Ensure figures and tables have clear labels
  • Confirm exercises are clearly labeled

Convert the final .html file to a .zip file and upload to Blackboard.


Grading (50 pts)

Component Points
Exercises 1–14 42 (3 each)
Workflow & formatting 5
Reflection 3
Bonus 2