library(tidyverse)
<- read_csv("batter_stats.csv")
batter_stats <- read_csv("pitcher_stats.csv") pitcher_stats
What’s the prime age of an MLB player?
Introduction
In this module you will be analyzing data on Major League Baseball (MLB) players from the 2014 - 2023 seasons. The goal of this module is to practice your data wrangling skills while investigating the question “at what age do professional baseball players typically reach their prime?”
Data / Variable Descriptions
We’ll get started by loading necessary packages and datasets.
This data was curated via the baseballr
package, which has built in functions for acquiring a plethora of baseball data. In particular, data for this module was pulled in May 2024 using the fg_batter_leaders()
and fg_pitcher_leaders()
functions, which provide over 300 variables of statistics on batters and pitchers beginning with the 1974 season. This analysis engages with only a small subset of available variables and seasons, but we encourage baseball enthusiasts and aspiring analysts to check out all the baseballr
package has to offer.
Variable Descriptions (all players)
Variable | Description |
---|---|
x_mlbamid |
unique MLB player id |
season |
season the row of data comes from |
team_name |
three-character MLB team abbreviation |
player_name |
first and last name of MLB player |
age |
age of MLB during the corresponding season |
war |
Wins Above Replacement - an estimation of the amount of wins the player would bring to the team, above a regular replacement level player. It measures the value of a player and how much better they are than an “average” one. |
g |
number of games the player played during the corresponding season |
Variable Descriptions (batters only)
Variable | Description |
---|---|
bats |
whether player bats right-handed or left-handed |
ab |
number of at bats the player had during the corresponding season |
pa |
number of plate appearances the player had during the corresponding season |
h |
number of hits the player had during the corresponding season |
x1b |
number of singles the player hit during the corresponding season |
x2b |
number of doubles the player hit during the corresponding season |
x3b |
number of triples the player hit during the corresponding season |
hr |
number of home runs the player hit during the corresponding season |
r |
number of runs the player had during the corresponding season |
rbi |
number of runs batted in the player had during the corresponding season |
Variable Descriptions (pitchers only)
Variable | Description |
---|---|
throws |
number of runs batted in the player had during the corresponding season |
ip |
number of innings the player pitched during the corresponding season |
era |
Earned Run Average: average number of earned runs given up by a pitcher per nine innings pitched during the corresponding season |
whip |
Walks plus Hits per Inning Pitched. A measure of the number of base-runners a pitcher has allowed per inning pitched during the corresponding season. |
pitches |
total number of pitches thrown by the pitcher during the corresponding season. |
balls |
total number of pitches thrown by the pitcher that were called as balls during the corresponding season. |
strikes |
total number of pitches thrown by the pitcher that were called as strikes during the corresponding season. |
Prime Age Analysis
There are many different aspects of a player’s performance that determine how well he is playing. One comprehensive metric is called WAR, which stands for “wins above replacement.” The MLB provides the following definition of WAR:
“WAR measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).” - MLB.com
Therefore, one reasonable way to determine a player’s prime age would be to determine the age at which he had his highest WAR.
Batters
Let’s first note how our data is organized:
head(batter_stats)
# A tibble: 6 × 18
x_mlbamid season team_name bats player_name age war g ab pa
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 545361 2014 LAA R Mike Trout 22 8.29 157 602 705
2 457763 2014 SFG R Buster Posey 27 7.52 147 547 605
3 518960 2014 MIL R Jonathan Lucroy 28 7.44 153 585 655
4 457705 2014 PIT R Andrew McCutch… 27 7.40 146 548 648
5 519317 2014 MIA R Giancarlo Stan… 24 6.85 145 539 638
6 488726 2014 CLE L Michael Brantl… 27 6.53 156 611 676
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
# r <dbl>, rbi <dbl>, best_war <dbl>
CODE TIP: The function head()
returns the first 6 rows of a dataset, and the function tail()
returns the last 6. You can add the argument n =
to display a different number of rows. Note these are base R functions and do not require the tidyverse to use.
If we arrange
by x_mlbaid
we can see that there can be multiple observations per player, where each row represents a different season.
|>
batter_stats arrange(x_mlbamid) |>
slice(1:10)
# A tibble: 10 × 18
x_mlbamid season team_name bats player_name age war g ab pa
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 110029 2014 NYM L Bobby Abreu 40 -0.213 78 133 155
2 112526 2014 NYM R Bartolo Col… 41 -0.588 31 62 69
3 112526 2015 NYM R Bartolo Col… 42 -0.0408 33 58 64
4 112526 2016 NYM R Bartolo Col… 43 -0.243 34 60 65
5 112526 2017 - - - R Bartolo Col… 44 -0.266 28 19 20
6 112526 2018 TEX R Bartolo Col… 45 -0.0475 28 4 4
7 114739 2014 CLE L Jason Giambi 43 -0.496 26 60 70
8 115629 2014 COL R LaTroy Hawk… 41 -0.0141 57 1 1
9 115629 2015 - - - R LaTroy Hawk… 42 -0.0146 42 1 1
10 116338 2014 DET R Torii Hunter 38 1.11 142 549 586
# ℹ 8 more variables: h <dbl>, x1b <dbl>, x2b <dbl>, x3b <dbl>, hr <dbl>,
# r <dbl>, rbi <dbl>, best_war <dbl>
The pipe: Recall that |>
is called the “pipe” function and can be read as “and then.” In English, the code on the left can be read as “take the batter_stats
data and then arrange
it by x_mlbamid
and then slice
the first 10 rows.” Mathematically, the pipe accomplishes f(g(x))
with the (psudeo-)code x |> g() |> f()
. Read more about the pipe here.
dplyr: arrange()
and slice()
are examples of dplyr
verbs: tidyverse
functions that do something to / act on the data. Other examples include filter()
, select()
, mutate()
, group_by()
, summarize()
, relocate()
, and many more. These verbs are often chained together with the pipe to accomplish multiple data wrangling tasks. Read more about data wrangling with dplyr
here.
TIP: Try writing your answer as a full sentence in the .qmd using inline code. For example, if you have the first season saved in an object first_season
, then including `r
first_season
` outside a code chunk will allow you to auto-populate this value in a sentence.
CODE TIP: group_by()
allows all subsequent actions to be done for each group of the grouping variable. Therefore, if we group by player id, we’re able to determine the maximum war
for each player, not simply the maximum war
for the whole dataset. It’s often a good idea to ungroup()
at the end of a chain of code, otherwise the next time you try to use your data, it will still perform every operation by group.
Take a quick glimpse()
of your data to confirm the first few values of best_war
match those below before proceeding.
Rows: 13,917
Columns: 18
$ x_mlbamid <dbl> 545361, 457763, 518960, 457705, 519317, 488726, 543685, 43…
$ season <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name <chr> "LAA", "SFG", "MIL", "PIT", "MIA", "CLE", "WSN", "TOR", "P…
$ bats <chr> "R", "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "R"…
$ player_name <chr> "Mike Trout", "Buster Posey", "Jonathan Lucroy", "Andrew M…
$ age <dbl> 22, 27, 28, 27, 24, 27, 24, 33, 31, 35, 28, 28, 31, 23, 30…
$ war <dbl> 8.2866, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.4054, 6.…
$ g <dbl> 157, 147, 153, 146, 145, 156, 153, 155, 111, 148, 148, 158…
$ ab <dbl> 602, 547, 585, 548, 539, 611, 613, 553, 379, 549, 574, 608…
$ pa <dbl> 705, 605, 655, 648, 638, 676, 683, 673, 460, 614, 644, 695…
$ h <dbl> 173, 170, 176, 172, 155, 200, 176, 158, 110, 178, 163, 155…
$ x1b <dbl> 89, 118, 108, 103, 86, 133, 110, 96, 79, 125, 102, 93, 134…
$ x2b <dbl> 39, 28, 53, 38, 31, 45, 39, 27, 20, 33, 34, 31, 37, 37, 34…
$ x3b <dbl> 9, 2, 2, 6, 1, 2, 6, 0, 0, 1, 4, 2, 2, 9, 1, 1, 2, 1, 3, 1…
$ hr <dbl> 36, 22, 13, 25, 37, 20, 21, 35, 11, 19, 23, 29, 14, 16, 19…
$ r <dbl> 115, 72, 73, 89, 89, 94, 111, 101, 45, 79, 95, 93, 77, 92,…
$ rbi <dbl> 111, 89, 69, 83, 105, 97, 83, 103, 67, 77, 73, 98, 82, 69,…
$ best_war <dbl> 9.4559, 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.7801, 6.…
CODE TIP: In real life data science work, you won’t usually be provided with the “corect” answer to compare to, so it’s often a good idea to do a quick check after any data transformation to make sure your code did what you expected. In this case, you might choose one player to verify that their best_war
value is in fact equal to their maximum war
value. You can do a quick filter
for that player in your console, or use the search feature when View
ing the full data in spreadsheet view.
Hint: what dyplr
verb do you need to keep rows that meet a criteria?
Ideally, we want there to be one row per player in our new dataset. However, if we check the number of unique players we have in our original data, we find this does not match the number of rows in prime_age
.
CODE TIP: Two options for checking the number of unique levels of a variable are length(unique(data$variable))
or data |> distinct(variable) |> nrow()
CODE TIP: group_by(grouping_variable)
followed by mutate(n = n())
will count the number of rows per level of the grouping variable.
Your reduced prime_age
should look something like this:
Rows: 3,752
Columns: 18
$ x_mlbamid <dbl> 457763, 518960, 457705, 519317, 488726, 430832, 431145, 13…
$ season <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
$ team_name <chr> "SFG", "MIL", "PIT", "MIA", "CLE", "TOR", "PIT", "TEX", "M…
$ bats <chr> "R", "R", "R", "R", "L", "R", "R", "R", "R", "R", "L", "L"…
$ player_name <chr> "Buster Posey", "Jonathan Lucroy", "Andrew McCutchen", "Gi…
$ age <dbl> 27, 28, 27, 24, 27, 33, 31, 35, 28, 23, 30, 24, 27, 35, 29…
$ war <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…
$ g <dbl> 147, 153, 146, 145, 156, 155, 111, 148, 148, 148, 156, 140…
$ ab <dbl> 547, 585, 548, 539, 611, 553, 379, 549, 574, 558, 563, 524…
$ pa <dbl> 605, 655, 648, 638, 676, 673, 460, 614, 644, 640, 643, 616…
$ h <dbl> 170, 176, 172, 155, 200, 158, 110, 178, 163, 165, 150, 150…
$ x1b <dbl> 118, 108, 103, 86, 133, 96, 79, 125, 102, 103, 96, 89, 103…
$ x2b <dbl> 28, 53, 38, 31, 45, 27, 20, 33, 34, 37, 34, 28, 35, 37, 18…
$ x3b <dbl> 2, 2, 6, 1, 2, 0, 0, 1, 4, 9, 1, 1, 2, 1, 1, 3, 1, 7, 3, 2…
$ hr <dbl> 22, 13, 25, 37, 20, 35, 11, 19, 23, 16, 19, 32, 36, 16, 21…
$ r <dbl> 72, 73, 89, 89, 94, 101, 45, 79, 95, 92, 87, 89, 80, 85, 7…
$ rbi <dbl> 89, 69, 83, 105, 97, 103, 67, 77, 73, 69, 74, 78, 107, 82,…
$ best_war <dbl> 7.5222, 7.4368, 7.4014, 6.8473, 6.5310, 6.1703, 6.1427, 5.…
Tip: Add a layer called geom_vline
to your ggplot code. Make sure the colors of the lines are different.
Pitchers
Check: there are 2382 unique pitchers in the pitcher_stats
data, so your final dataset for analysis should have that many rows.
Wrap-up / reflection
FOR FUN
You can investigate an individual player’s WAR trajectory over time using the app below. If you’re curious, you can see the R code that built the Shiny app here and even try making your own!