library(tidyverse)
batter_stats <- read_csv("batter_stats.csv")
pitcher_stats <- read_csv("pitcher_stats.csv")What’s the prime age of an MLB player?
Introduction
In this module you will be analyzing data on Major League Baseball (MLB) players from the 2014 - 2023 seasons. The goal of this module is to practice your data wrangling skills while investigating the question “at what age do professional baseball players typically reach their prime?”
Data / Variable Descriptions
We’ll get started by loading necessary packages and datasets.
This data was curated via the baseballr package, which has built in functions for acquiring a plethora of baseball data. In particular, data for this module was pulled in May 2024 using the fg_batter_leaders() and fg_pitcher_leaders() functions, which provide over 300 variables of statistics on batters and pitchers beginning with the 1974 season. This analysis engages with only a small subset of available variables and seasons, but we encourage baseball enthusiasts and aspiring analysts to check out all the baseballr package has to offer.
Variable Descriptions (all players)
| Variable | Description | 
|---|---|
| x_mlbamid | unique MLB player id | 
| season | season the row of data comes from | 
| team_name | three-character MLB team abbreviation | 
| player_name | first and last name of MLB player | 
| age | age of MLB during the corresponding season | 
| war | Wins Above Replacement - an estimation of the amount of wins the player would bring to the team, above a regular replacement level player. It measures the value of a player and how much better they are than an “average” one. | 
| g | number of games the player played during the corresponding season | 
Variable Descriptions (batters only)
| Variable | Description | 
|---|---|
| bats | whether player bats right-handed or left-handed | 
| ab | number of at bats the player had during the corresponding season | 
| pa | number of plate appearances the player had during the corresponding season | 
| h | number of hits the player had during the corresponding season | 
| x1b | number of singles the player hit during the corresponding season | 
| x2b | number of doubles the player hit during the corresponding season | 
| x3b | number of triples the player hit during the corresponding season | 
| hr | number of home runs the player hit during the corresponding season | 
| r | number of runs the player had during the corresponding season | 
| rbi | number of runs batted in the player had during the corresponding season | 
Variable Descriptions (pitchers only)
| Variable | Description | 
|---|---|
| throws | number of runs batted in the player had during the corresponding season | 
| ip | number of innings the player pitched during the corresponding season | 
| era | Earned Run Average: average number of earned runs given up by a pitcher per nine innings pitched during the corresponding season | 
| whip | Walks plus Hits per Inning Pitched. A measure of the number of base-runners a pitcher has allowed per inning pitched during the corresponding season. | 
| pitches | total number of pitches thrown by the pitcher during the corresponding season. | 
| balls | total number of pitches thrown by the pitcher that were called as balls during the corresponding season. | 
| strikes | total number of pitches thrown by the pitcher that were called as strikes during the corresponding season. | 
Prime Age Analysis
There are many different aspects of a player’s performance that determine how well he is playing. One comprehensive metric is called WAR, which stands for “wins above replacement.” The MLB provides the following definition of WAR:
“WAR measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).” - MLB.com
Therefore, one reasonable way to determine a player’s prime age would be to determine the age at which he had his highest WAR.
Batters
Let’s first note how our data is organized:
head(batter_stats)# A tibble: 6 × 18
  x_mlbamid season team_name bats  player_name       age   war     g    ab    pa
      <dbl>  <dbl> <chr>     <chr> <chr>           <dbl> <dbl> <dbl> <dbl> <dbl>
1    545361   2014 LAA       R     Mike Trout         22  8.29   157   602   705
2    457763   2014 SFG       R     Buster Posey       27  7.52   147   547   605
3    518960   2014 MIL       R     Jonathan Lucroy    28  7.44   153   585   655
4    457705   2014 PIT       R     Andrew McCutch…    27  7.40   146   548   648
5    519317   2014 MIA       R     Giancarlo Stan…    24  6.85   145   539   638
6    488726   2014 CLE       L     Michael Brantl…    27  6.53   156   611   676
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
#   r <dbl>, rbi <dbl>, best_war <dbl>
CODE TIP: The function head() returns the first 6 rows of a dataset, and the function tail() returns the last 6. You can add the argument n = to display a different number of rows. Note these are base R functions and do not require the tidyverse to use.
If we arrange by x_mlbaid we can see that there can be multiple observations per player, where each row represents a different season.
batter_stats |> 
  arrange(x_mlbamid) |> 
  slice(1:10)# A tibble: 10 × 18
   x_mlbamid season team_name bats  player_name    age     war     g    ab    pa
       <dbl>  <dbl> <chr>     <chr> <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
 1    110029   2014 NYM       L     Bobby Abreu     40 -0.213     78   133   155
 2    112526   2014 NYM       R     Bartolo Col…    41 -0.588     31    62    69
 3    112526   2015 NYM       R     Bartolo Col…    42 -0.0408    33    58    64
 4    112526   2016 NYM       R     Bartolo Col…    43 -0.243     34    60    65
 5    112526   2017 - - -     R     Bartolo Col…    44 -0.266     28    19    20
 6    112526   2018 TEX       R     Bartolo Col…    45 -0.0475    28     4     4
 7    114739   2014 CLE       L     Jason Giambi    43 -0.496     26    60    70
 8    115629   2014 COL       R     LaTroy Hawk…    41 -0.0141    57     1     1
 9    115629   2015 - - -     R     LaTroy Hawk…    42 -0.0146    42     1     1
10    116338   2014 DET       R     Torii Hunter    38  1.11     142   549   586
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
#   r <dbl>, rbi <dbl>, best_war <dbl> The pipe: Recall that |> is called the “pipe” function and can be read as “and then.” In English, the code on the left can be read as “take the batter_stats data and then arrange it by x_mlbamid and then slice the first 10 rows.” Mathematically, the pipe accomplishes f(g(x)) with the (psudeo-)code x |> g() |> f(). Read more about the pipe here.
dplyr: arrange() and slice() are examples of dplyr verbs: tidyverse functions that do something to / act on the data. Other examples include filter(), select(), mutate(), group_by(), summarize(), relocate(), and many more. These verbs are often chained together with the pipe to accomplish multiple data wrangling tasks. Read more about data wrangling with dplyr here.
TIP: Try writing your answer as a full sentence in the .qmd using inline code. For example, if you have the first season saved in an object first_season, then including `r first_season` outside a code chunk will allow you to auto-populate this value in a sentence.
CODE TIP: group_by() allows all subsequent actions to be done for each group of the grouping variable. Therefore, if we group by player id, we’re able to determine the maximum war for each player, not simply the maximum war for the whole dataset. It’s often a good idea to ungroup() at the end of a chain of code, otherwise the next time you try to use your data, it will still perform every operation by group.
Take a quick glimpse() of your data to confirm the first few values of best_war match those below before proceeding.
Rows: 13,917
Columns: 18
$ x_mlbamid   <dbl> 545361, 457763, 518960, 457705, 519317, 488726, 543685, 43…
$ season      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name   <chr> "LAA", "SFG", "MIL", "PIT", "MIA", "CLE", "WSN", "TOR", "P…
$ bats        <chr> "R", "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "R"…
$ player_name <chr> "Mike Trout", "Buster Posey", "Jonathan Lucroy", "Andrew M…
$ age         <dbl> 22, 27, 28, 27, 24, 27, 24, 33, 31, 35, 28, 28, 31, 23, 30…
$ war         <dbl> 8.2866, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.4054, 6.…
$ g           <dbl> 157, 147, 153, 146, 145, 156, 153, 155, 111, 148, 148, 158…
$ ab          <dbl> 602, 547, 585, 548, 539, 611, 613, 553, 379, 549, 574, 608…
$ pa          <dbl> 705, 605, 655, 648, 638, 676, 683, 673, 460, 614, 644, 695…
$ h           <dbl> 173, 170, 176, 172, 155, 200, 176, 158, 110, 178, 163, 155…
$ x1b         <dbl> 89, 118, 108, 103, 86, 133, 110, 96, 79, 125, 102, 93, 134…
$ x2b         <dbl> 39, 28, 53, 38, 31, 45, 39, 27, 20, 33, 34, 31, 37, 37, 34…
$ x3b         <dbl> 9, 2, 2, 6, 1, 2, 6, 0, 0, 1, 4, 2, 2, 9, 1, 1, 2, 1, 3, 1…
$ hr          <dbl> 36, 22, 13, 25, 37, 20, 21, 35, 11, 19, 23, 29, 14, 16, 19…
$ r           <dbl> 115, 72, 73, 89, 89, 94, 111, 101, 45, 79, 95, 93, 77, 92,…
$ rbi         <dbl> 111, 89, 69, 83, 105, 97, 83, 103, 67, 77, 73, 98, 82, 69,…
$ best_war    <dbl> 9.4559, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.7801, 6.…CODE TIP: In real life data science work, you won’t usually be provided with the “corect” answer to compare to, so it’s often a good idea to do a quick check after any data transformation to make sure your code did what you expected. In this case, you might choose one player to verify that their best_war value is in fact equal to their maximum war value. You can do a quick filter for that player in your console, or use the search feature when Viewing the full data in spreadsheet view.
Hint: what dyplr verb do you need to keep rows that meet a criteria?
Ideally, we want there to be one row per player in our new dataset. However, if we check the number of unique players we have in our original data, we find this does not match the number of rows in prime_age.
CODE TIP: Two options for checking the number of unique levels of a variable are length(unique(data$variable)) or data |> distinct(variable) |> nrow()
CODE TIP: group_by(grouping_variable) followed by mutate(n = n()) will count the number of rows per level of the grouping variable.
Your reduced prime_age should look something like this:
Rows: 3,752
Columns: 18
$ x_mlbamid   <dbl> 457763, 518960, 457705, 519317, 488726, 430832, 431145, 13…
$ season      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name   <chr> "SFG", "MIL", "PIT", "MIA", "CLE", "TOR", "PIT", "TEX", "M…
$ bats        <chr> "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "L", "L"…
$ player_name <chr> "Buster Posey", "Jonathan Lucroy", "Andrew McCutchen", "Gi…
$ age         <dbl> 27, 28, 27, 24, 27, 33, 31, 35, 28, 23, 30, 24, 27, 35, 29…
$ war         <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…
$ g           <dbl> 147, 153, 146, 145, 156, 155, 111, 148, 148, 148, 156, 140…
$ ab          <dbl> 547, 585, 548, 539, 611, 553, 379, 549, 574, 558, 563, 524…
$ pa          <dbl> 605, 655, 648, 638, 676, 673, 460, 614, 644, 640, 643, 616…
$ h           <dbl> 170, 176, 172, 155, 200, 158, 110, 178, 163, 165, 150, 150…
$ x1b         <dbl> 118, 108, 103, 86, 133, 96, 79, 125, 102, 103, 96, 89, 103…
$ x2b         <dbl> 28, 53, 38, 31, 45, 27, 20, 33, 34, 37, 34, 28, 35, 37, 18…
$ x3b         <dbl> 2, 2, 6, 1, 2, 0, 0, 1, 4, 9, 1, 1, 2, 1, 1, 3, 1, 7, 3, 2…
$ hr          <dbl> 22, 13, 25, 37, 20, 35, 11, 19, 23, 16, 19, 32, 36, 16, 21…
$ r           <dbl> 72, 73, 89, 89, 94, 101, 45, 79, 95, 92, 87, 89, 80, 85, 7…
$ rbi         <dbl> 89, 69, 83, 105, 97, 103, 67, 77, 73, 69, 74, 78, 107, 82,…
$ best_war    <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…
Tip: Add a layer called geom_vline to your ggplot code. Make sure the colors of the lines are different.
Pitchers
Check: there are 2382 unique pitchers in the pitcher_stats data, so your final dataset for analysis should have that many rows.
Wrap-up / reflection
FOR FUN
You can investigate an individual player’s WAR trajectory over time using the app below. If you’re curious, you can see the R code that built the Shiny app here and even try making your own!