What’s the prime age of an MLB player?

Data wrangling

Dplyr basics

Tidyverse

Use data wrangling skills to explore the prime age of MLB players

Authors

Affiliation

Jazmine Gurrola

Azusa Pacific University

Joseph Hsieh

Azusa Pacific University

Dat Tran

Azusa Pacific University

Katie Fitzgerald

Azusa Pacific University

Published

July 26, 2024

Methods/Facilitation notes

This module would be suitable for an in-class lab or take-home assignment in an introductory data science course that uses R.
It assumes a basic familiarity with the RStudio Environment and basic introduction to the tidyverse has already been covered, but tips on tidyverse code are provided throughout.
Students should be provided with the following data files (.csv) and Quarto document (.qmd) to produce visualizations and write up their answers to each exercise. Their final deliverable is to turn in an .html document produced by “Rendering” the .qmd.
Posit Cloud (via an Instructor account) or Github classroom are good options for disseminating files to students, but simply uploading files to your university’s course management system works, too.

Introduction

In this module you will be analyzing data on Major League Baseball (MLB) players from the 2014 - 2023 seasons. The goal of this module is to practice your data wrangling skills while investigating the question “at what age do professional baseball players typically reach their prime?”

Learning Objectives

By the end of this module, you will be able to:

Wrangle your data into a form that allows you to answer a research question of interest
Use dplyr verbs to accomplish data wrangling tasks such as:
- creating a new variable (mutate()),
- filter()ing the data to keep or exclude rows that meet a criteria
- computing a statistic for each level of a categorical variable using group_by()
- quickly viewing the data with slice() or glimpse()
- keeping only unique observations using distinct()
- sorting the data using arrange()
Develop and implement a code strategy to accomplish a necessary wrangling task
Use visualization to investigate a research question

Data / Variable Descriptions

We’ll get started by loading necessary packages and datasets.

library(tidyverse)

batter_stats <- read_csv("batter_stats.csv")
pitcher_stats <- read_csv("pitcher_stats.csv")

This data was curated via the baseballr package, which has built in functions for acquiring a plethora of baseball data. In particular, data for this module was pulled in May 2024 using the fg_batter_leaders() and fg_pitcher_leaders() functions, which provide over 300 variables of statistics on batters and pitchers beginning with the 1974 season. This analysis engages with only a small subset of available variables and seasons, but we encourage baseball enthusiasts and aspiring analysts to check out all the baseballr package has to offer.

Variable Descriptions (all players)

Variable	Description
`x_mlbamid`	unique MLB player id
`season`	season the row of data comes from
`team_name`	three-character MLB team abbreviation
`player_name`	first and last name of MLB player
`age`	age of MLB during the corresponding season
`war`	Wins Above Replacement - an estimation of the amount of wins the player would bring to the team, above a regular replacement level player. It measures the value of a player and how much better they are than an “average” one.
`g`	number of games the player played during the corresponding season

Variable Descriptions (batters only)

Variable	Description
`bats`	whether player bats right-handed or left-handed
`ab`	number of at bats the player had during the corresponding season
`pa`	number of plate appearances the player had during the corresponding season
`h`	number of hits the player had during the corresponding season
`x1b`	number of singles the player hit during the corresponding season
`x2b`	number of doubles the player hit during the corresponding season
`x3b`	number of triples the player hit during the corresponding season
`hr`	number of home runs the player hit during the corresponding season
`r`	number of runs the player had during the corresponding season
`rbi`	number of runs batted in the player had during the corresponding season

Variable Descriptions (pitchers only)

Variable	Description
`throws`	number of runs batted in the player had during the corresponding season
`ip`	number of innings the player pitched during the corresponding season
`era`	Earned Run Average: average number of earned runs given up by a pitcher per nine innings pitched during the corresponding season
`whip`	Walks plus Hits per Inning Pitched. A measure of the number of base-runners a pitcher has allowed per inning pitched during the corresponding season.
`pitches`	total number of pitches thrown by the pitcher during the corresponding season.
`balls`	total number of pitches thrown by the pitcher that were called as balls during the corresponding season.
`strikes`	total number of pitches thrown by the pitcher that were called as strikes during the corresponding season.

Prime Age Analysis

Research question

At what age do professional baseball players tend to be at their “prime”?

There are many different aspects of a player’s performance that determine how well he is playing. One comprehensive metric is called WAR, which stands for “wins above replacement.” The MLB provides the following definition of WAR:

“WAR measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).” - MLB.com

Therefore, one reasonable way to determine a player’s prime age would be to determine the age at which he had his highest WAR.

What’s a good WAR?

A decent WAR is typically around 2, while All-star players typically have a WAR anywhere between 3 and 6, and MVP-level players tend to have a WAR above 6.

Batters vs Pitchers

WAR is calculated differently for batters and pitchers; they play different roles in the game and therefore they have different stats that capture their performance. We will therefore analyze batters and pitchers separately.

Batters

Research question

What is the average age a batter in the MLB reaches his prime?

Let’s first note how our data is organized:

head(batter_stats)

# A tibble: 6 × 18
  x_mlbamid season team_name bats  player_name       age   war     g    ab    pa
      <dbl>  <dbl> <chr>     <chr> <chr>           <dbl> <dbl> <dbl> <dbl> <dbl>
1    545361   2014 LAA       R     Mike Trout         22  8.29   157   602   705
2    457763   2014 SFG       R     Buster Posey       27  7.52   147   547   605
3    518960   2014 MIL       R     Jonathan Lucroy    28  7.44   153   585   655
4    457705   2014 PIT       R     Andrew McCutch…    27  7.40   146   548   648
5    519317   2014 MIA       R     Giancarlo Stan…    24  6.85   145   539   638
6    488726   2014 CLE       L     Michael Brantl…    27  6.53   156   611   676
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
#   r <dbl>, rbi <dbl>, best_war <dbl>

CODE TIP: The function head() returns the first 6 rows of a dataset, and the function tail() returns the last 6. You can add the argument n = to display a different number of rows. Note these are base R functions and do not require the tidyverse to use.

If we arrange by x_mlbaid we can see that there can be multiple observations per player, where each row represents a different season.

batter_stats |> 
  arrange(x_mlbamid) |> 
  slice(1:10)

# A tibble: 10 × 18
   x_mlbamid season team_name bats  player_name    age     war     g    ab    pa
       <dbl>  <dbl> <chr>     <chr> <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
 1    110029   2014 NYM       L     Bobby Abreu     40 -0.213     78   133   155
 2    112526   2014 NYM       R     Bartolo Col…    41 -0.588     31    62    69
 3    112526   2015 NYM       R     Bartolo Col…    42 -0.0408    33    58    64
 4    112526   2016 NYM       R     Bartolo Col…    43 -0.243     34    60    65
 5    112526   2017 - - -     R     Bartolo Col…    44 -0.266     28    19    20
 6    112526   2018 TEX       R     Bartolo Col…    45 -0.0475    28     4     4
 7    114739   2014 CLE       L     Jason Giambi    43 -0.496     26    60    70
 8    115629   2014 COL       R     LaTroy Hawk…    41 -0.0141    57     1     1
 9    115629   2015 - - -     R     LaTroy Hawk…    42 -0.0146    42     1     1
10    116338   2014 DET       R     Torii Hunter    38  1.11     142   549   586
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
#   r <dbl>, rbi <dbl>, best_war <dbl>

The pipe: Recall that |> is called the “pipe” function and can be read as “and then.” In English, the code on the left can be read as “take the batter_stats data and then arrange it by x_mlbamid and then slice the first 10 rows.” Mathematically, the pipe accomplishes f(g(x)) with the (psudeo-)code x |> g() |> f(). Read more about the pipe here.

dplyr: arrange() and slice() are examples of dplyr verbs: tidyverse functions that do something to / act on the data. Other examples include filter(), select(), mutate(), group_by(), summarize(), relocate(), and many more. These verbs are often chained together with the pipe to accomplish multiple data wrangling tasks. Read more about data wrangling with dplyr here.

Exercise 1:

Which seasons are included in this data?

TIP: Try writing your answer as a full sentence in the .qmd using inline code. For example, if you have the first season saved in an object first_season, then including `r first_season` outside a code chunk will allow you to auto-populate this value in a sentence.

Important

In order to determine the prime age of each player, we need to look for the year in which his war reached its player-specific maximum. We can utilize the group_by() function to do this.

Exercise 2:

Copy, paste the following code and fill in the blanks to create a new variable best_war that contains a player’s maximum war.

batter_stats <- batter_stats |> 
  group_by(________) |> 
  mutate(_______ = _______(_______)) |> 
  ungroup()

CODE TIP: group_by() allows all subsequent actions to be done for each group of the grouping variable. Therefore, if we group by player id, we’re able to determine the maximum war for each player, not simply the maximum war for the whole dataset. It’s often a good idea to ungroup() at the end of a chain of code, otherwise the next time you try to use your data, it will still perform every operation by group.

Take a quick glimpse() of your data to confirm the first few values of best_war match those below before proceeding.

Rows: 13,917
Columns: 18
$ x_mlbamid   <dbl> 545361, 457763, 518960, 457705, 519317, 488726, 543685, 43…
$ season      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name   <chr> "LAA", "SFG", "MIL", "PIT", "MIA", "CLE", "WSN", "TOR", "P…
$ bats        <chr> "R", "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "R"…
$ player_name <chr> "Mike Trout", "Buster Posey", "Jonathan Lucroy", "Andrew M…
$ age         <dbl> 22, 27, 28, 27, 24, 27, 24, 33, 31, 35, 28, 28, 31, 23, 30…
$ war         <dbl> 8.2866, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.4054, 6.…
$ g           <dbl> 157, 147, 153, 146, 145, 156, 153, 155, 111, 148, 148, 158…
$ ab          <dbl> 602, 547, 585, 548, 539, 611, 613, 553, 379, 549, 574, 608…
$ pa          <dbl> 705, 605, 655, 648, 638, 676, 683, 673, 460, 614, 644, 695…
$ h           <dbl> 173, 170, 176, 172, 155, 200, 176, 158, 110, 178, 163, 155…
$ x1b         <dbl> 89, 118, 108, 103, 86, 133, 110, 96, 79, 125, 102, 93, 134…
$ x2b         <dbl> 39, 28, 53, 38, 31, 45, 39, 27, 20, 33, 34, 31, 37, 37, 34…
$ x3b         <dbl> 9, 2, 2, 6, 1, 2, 6, 0, 0, 1, 4, 2, 2, 9, 1, 1, 2, 1, 3, 1…
$ hr          <dbl> 36, 22, 13, 25, 37, 20, 21, 35, 11, 19, 23, 29, 14, 16, 19…
$ r           <dbl> 115, 72, 73, 89, 89, 94, 111, 101, 45, 79, 95, 93, 77, 92,…
$ rbi         <dbl> 111, 89, 69, 83, 105, 97, 83, 103, 67, 77, 73, 98, 82, 69,…
$ best_war    <dbl> 9.4559, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.7801, 6.…

CODE TIP: In real life data science work, you won’t usually be provided with the “corect” answer to compare to, so it’s often a good idea to do a quick check after any data transformation to make sure your code did what you expected. In this case, you might choose one player to verify that their best_war value is in fact equal to their maximum war value. You can do a quick filter for that player in your console, or use the search feature when Viewing the full data in spreadsheet view.

Exercise 3:

Create a new dataset called prime_age that keeps only the rows where a player’s war is equal to his best_war.

What are the dimensions of this new dataset?

Hint: what dyplr verb do you need to keep rows that meet a criteria?

Ideally, we want there to be one row per player in our new dataset. However, if we check the number of unique players we have in our original data, we find this does not match the number of rows in prime_age.

CODE TIP: Two options for checking the number of unique levels of a variable are length(unique(data$variable)) or data |> distinct(variable) |> nrow()

Exercise 4:

Report the number of unique players in the dataset batter_stats.

Inspect the prime_age data more closely. What is the maximum number of rows that appear for a player in this dataset? Comment on why this is happening. Hint: creating a new variable that counts the number of rows per id can help you investigate this.

CODE TIP: group_by(grouping_variable) followed by mutate(n = n()) will count the number of rows per level of the grouping variable.

Exercise 5:

Determine a strategy for reducing prime_age down to one row per person (still maintaining all relevant columns). Describe your strategy in words and then write code to accomplish it. Careful - don’t just arbitrarily throw away rows! There are multiple ways you might approach this, but you should justify your decision(s) and think through implications for your ultimate analysis goal: estimating prime age.

Your reduced prime_age should look something like this:

Rows: 3,752
Columns: 18
$ x_mlbamid   <dbl> 457763, 518960, 457705, 519317, 488726, 430832, 431145, 13…
$ season      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name   <chr> "SFG", "MIL", "PIT", "MIA", "CLE", "TOR", "PIT", "TEX", "M…
$ bats        <chr> "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "L", "L"…
$ player_name <chr> "Buster Posey", "Jonathan Lucroy", "Andrew McCutchen", "Gi…
$ age         <dbl> 27, 28, 27, 24, 27, 33, 31, 35, 28, 23, 30, 24, 27, 35, 29…
$ war         <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…
$ g           <dbl> 147, 153, 146, 145, 156, 155, 111, 148, 148, 148, 156, 140…
$ ab          <dbl> 547, 585, 548, 539, 611, 553, 379, 549, 574, 558, 563, 524…
$ pa          <dbl> 605, 655, 648, 638, 676, 673, 460, 614, 644, 640, 643, 616…
$ h           <dbl> 170, 176, 172, 155, 200, 158, 110, 178, 163, 165, 150, 150…
$ x1b         <dbl> 118, 108, 103, 86, 133, 96, 79, 125, 102, 103, 96, 89, 103…
$ x2b         <dbl> 28, 53, 38, 31, 45, 27, 20, 33, 34, 37, 34, 28, 35, 37, 18…
$ x3b         <dbl> 2, 2, 6, 1, 2, 0, 0, 1, 4, 9, 1, 1, 2, 1, 1, 3, 1, 7, 3, 2…
$ hr          <dbl> 22, 13, 25, 37, 20, 35, 11, 19, 23, 16, 19, 32, 36, 16, 21…
$ r           <dbl> 72, 73, 89, 89, 94, 101, 45, 79, 95, 92, 87, 89, 80, 85, 7…
$ rbi         <dbl> 89, 69, 83, 105, 97, 103, 67, 77, 73, 69, 74, 78, 107, 82,…
$ best_war    <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…

Exercise 6:

Produce a visualization that explores the distribution of prime ages, for all players in this data.

Exercise 7:

Based on the graph, “eyeball” an initial answer to the research question: at what age do professional batters tend to be at their “prime”?

Exercise 8:

Calculate the mean and the median prime age for batters in this data.

Exercise 9:

Reproduce your graph from above but add 2 lines to the graph representing the mean and median of the distribution.

Tip: Add a layer called geom_vline to your ggplot code. Make sure the colors of the lines are different.

Pitchers

Research question

What is the average age an MLB pitcher reaches his prime?

Exercise 10

Copy, paste, tweak appropriate code from previous exercises to determine the prime age of pitchers, using the pitcher_stats data.

Check: there are 2382 unique pitchers in the pitcher_stats data, so your final dataset for analysis should have that many rows.

Wrap-up / reflection

Exercise 11

Write a paragraph summarizing your findings about the prime age of batters and pitchers from this analysis. Things to consider:

Are the prime ages of batters and pitchers similar or different?
Do all players hit their prime at about the same age, or is there a wide range?
Are there limitations to this analysis?
What additional analyses would you want to conduct to investigate prime age more fully?
Is there any additional data you would want to explore further?

FOR FUN

You can investigate an individual player’s WAR trajectory over time using the app below. If you’re curious, you can see the R code that built the Shiny app here and even try making your own!