Lecture 7: Offensive Credit Allocation in Baseball
Overview
In Lecture 6, we computed the run value created by the offensive team in each at-bat of the 2024 MLB regular season. Run value is the sum of (i) the number of runs scored in the at-bat and (ii) the change in the number of runs the batting team is expected to score in the remainder of the half-inning. This change in expected runs is driven by the change in the combination of the number of outs and baserunner configuration. We then ranked players based on their run value totals, aggregating over all their at-bats. While the resulting rankings did appear to pass the “eye test” — both Aaron Judge and Shohei Ohtani created some of the largest run values — the metric implicitly gives batters all the credit for creating run value.
Over the next two lectures, we will develop our own version of wins above replacement. Our development largely follows that of Baumer, Jensen, and Matthews (2015) but with some important differences.
Conservation of Runs
The central idea — what (Baumer, Jensen, and Matthews 2015) call the “conservation of runs” framework — is that if the batting team gains \(\delta_{i}\) units of run value during an at-bat, the fielding team gains \(-\delta_{i}\) units of run value during that same at-bat. In this lecture, we will apportion \(\delta_{i}\) between the batters (Section 4) and the baserunners involved in at-bat \(i\) (Section 5) In Lecture 8, we will apportion \(-\delta_{i}\) between the pitcher and fielders involved in at-bat \(i.\)
Data Preparation
To divide up offensive run value, we need to create a data table whose rows correspond to individual at-bats. This data table must, at a minimum, contain the starting and ending outs and baserunner configurations as well as the identities of the baserunners at the start and end of the at-bat. We will also want to include the columns event and des, which record the events and a narrative description of what happened in the at-bat.
The column des includes a much more detailed description of what happened during the plate appearance. A cursory look through the values of des corresponding to rows with missing end_events reveals that several of these at-bats ended with a walk, involved an automatic strike1, or an inning-ending pick off2
# A tibble: 15 × 3
Outs end_Outs des
<int> <dbl> <chr>
1 0 0 Mookie Betts walks.
2 2 2 Freddie Freeman walks.
3 2 3 Xander Bogaerts strikes out on automatic strike.
4 2 2 Héctor Neris intentionally walks Wyatt Langford.
5 2 2 Logan Webb intentionally walks Ha-Seong Kim.
6 2 2 Cole Ragans intentionally walks Carlos Santana.
7 1 2 Andrew Vaughn strikes out on automatic strike.
8 2 3 Oneil Cruz strikes out on automatic strike.
9 2 3 Pitcher Bryce Miller picks off Wilyer Abreu at on throw to sh…
10 1 1 Yohan Ramírez intentionally walks Christian Yelich.
11 1 2 Alex Kirilloff strikes out on automatic strike.
12 1 1 Tony Kemp walks. James McCann to 2nd.
13 2 3 With Anthony Rendon batting, Zach Neto picked off and caught …
14 2 3 Miguel Sanó strikes out on automatic strike.
15 2 3 With Vinnie Pasquantino batting, Bobby Witt Jr. picked off an…
The following code manually corrects the missing values for end_events
After accounting for the walks and strike outs on automatic strikes, all but one of the at-bats that still had a missing end_events value involved a pick-off that ended the inning
2
The remaining at-bat involved a fly out that was caught in foul territory.
Adjusted Run Values
We want to give credit to the batter and base runners for creating value over and above what would been expected given the game state and the actual outcome of the at-bat. More precisely, recall that \(\delta_{i}\) is the run value created in at-bat \(i.\) We will denote the game state at the beginning of the at-bat with \(\textrm{g}_{i}\) and the ending event with \(\textrm{e}_{i}.\) We form the game state variable \(\textrm{g}\) by concatenating the Outs and BaseRunners and separating them with a period so that \(\textrm{g} = "0.101"\) corresponds to a situation with no outs and runners on first and third base.
We will assume that the run value created in each at-bat beginning in state \(\textrm{g}\) and ending with event \(\textrm{e}\) is equal to the average run value created in all at-bats with the same beginning and end plus some mean-zero error. That is, for each at-bat \(i\), \[
\delta_{i} = \mathbb{E}[\delta \vert \textrm{g} = \textrm{g}_{i}, \textrm{e} = \textrm{e}_{i}] + \varepsilon_{i},
\] The average run value \(\mu:=\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}]\) represents the average run value created in at-bats that begin in state \(\textrm{g}\) and end with the event \(\textrm{e}.\)
It is tempting to compute the expectation \(\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}]\) using the “binning-and-averaging” approach we took when developing our initial XG models back in Lecture 2. Unfortunately, such a procedure is liable to yield extreme and erratic answers as the number of bins is quite large. To wit, there are 24 distinct game states (i.e., combinations of outs and base runners) and 21 different events.
The 2024 dataset contains only 373 of the 504 total combinations of game state and ending event. Of the observed combinations, there is a huge disparity in the relative frequencies. Some combinations (e.g., triples with no outs and runners on second and third) occurred just once while others (e.g., striking out with no outs and nobody on) occurred close to 10,000 times.
Instead of “binning and averaging”, like we did with our distance-based XG models in Lecture 3, we will fit a statistical model. A natural starting model asserts that there are numbers \(\alpha_{0.000}, \ldots, \alpha_{2.111}\) and \(\alpha_{\textrm{catcher\_interf}}, \ldots, \alpha_{\textrm{walk}}\) such that for all game states \(\textrm{g}\) and ending events \(\textrm{e},\)\[
\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}] = \alpha_{\textrm{g}} + \alpha_{\textrm{e}}.
\]
Under the assumed model, the average run value created by hitting a single when there are two outs and no runners on is \(\alpha_{\textrm{2.000}} + \alpha_{\textrm{single}}\) while the average run value created by hitting a single when there are no outs and runners on first and second is \(\alpha_{\textrm{0.110}} + \alpha_{\textrm{single}}.\)
Because we do not know the exact values of the \(\alpha_{g}\)’s and \(\alpha_{e}\)’s, we need to estimate them using our data. Perhaps the simplest way is by solving a least squares minimization problem \[
\hat{\boldsymbol{\alpha}} = \textrm{argmin} \sum_{i = 1}^{n}{(\delta_{i} - \alpha_{g_{i}} - \alpha_{e_{i}})^2},
\] where \(g_{i}\) and \(e_{i}\) record the game state and event of at-bat \(i.\)
Solving this problem is equivalent to fitting a linear regression model without an intercept3. We can do this in R using the lm() function and including -1 in the formula argument4. In the following code, we create a temporary data frame that extracts just the run values \(\delta\), game states \(\textrm{g}\), and ending events \(\textrm{e}\) from atbats2024 and convert the game state and event variables into factors.
We estimate \(\hat{\alpha}_{2.000} \approx 0.356\) and \(\hat{\alpha}_{single} \approx 0.115.\) So, according to our fitted model the average run value created by singling when there are two outs and no runners on is about \(0.471.\)
Statistical Significance & Model Assumptions
You’ll notice that summary() returns a lot of inferential output (e.g., standard errors, p-values). These are computed under an additional assumption that the true errors \(\varepsilon_{i}\) are independent and following a mean-zero normal distribution with constant variance. Since our main interest is prediction, we’re really not interested in checking whether, say, \(\alpha_{0:010}\) is statistically significantly different than zero. So, we will not check whether the usual multiple linear model assumptions nor will we attempt. If you did want to make inferential statements about our model parameters, you would need to first check that the multiple linear multiple assumptions are not grossly violated.
Equipped with our estimated model parameters, for each at-bat \(i,\) let \(\hat{\mu}_{i} = \hat{\alpha}_{\textrm{g}_{i}} + \hat{\alpha}_{\textrm{e}_{i}}\) and let \(\eta_{i} = \delta_{i} - \hat{\mu}_{i}.\) In terms of dividing credit between the batter and the base runner, we will follow Baumer, Jensen, and Matthews (2015) and attribute \(\hat{\mu}_{i}\) to the batter’s hitting in at-bat \(i\) and \(\eta_{i}\) to the base running in that at-bat. We will add columns to atbat2024 holding the values of \(\hat{\mu}\) (mu) and \(\eta\) (eta).
Ohtani’s second hit against the Padres on March 20, 2024 was a single with runners on first and second and no outs. While Ohtani and the runner originally on first advanced one base, the runner originally on second scored. Because this latter runner advanced more than what might have been otherwise expected, it makes sense to give him a larger share of the \(\eta_{i}\) than to the first two runners, who only advanced one base on a single. Following (Baumer, Jensen, and Matthews 2015, sec. 3.2), the amount of base running run value \(\eta_{i}\) that we assign to base runner \(j\) in at-bat \(i\) will be proportional to \(\kappa_{ij} = \mathbb{P}(K < k_{ij} \vert \textrm{e}_{i}),\) where \(k_{ij}\) is the number of bases actually advanced by the base runner.
Essentially, \(\kappa_{ij}\) is the probability that a typical base runner advanced at most the \(k_{ij}\) bases advanced by base runner \(j\) in at-bat \(i\) following event \(\textrm{e}_{i}.\) If the base runner does worse than expected (e.g., not advancing from second on a single), then \(\kappa_{ij}\) will be very small. But if the base runner does better than expected (e.g., scoring from second on a single), then \(\kappa_{ij}\) will be larger. When computing \(\kappa_{ij}\) it is crucial that we condition on the actual ending event \(\textrm{e}_{i}.\) After all, while we may want to penalize a runner for not advancing from second on a single, we definitely don’t want to penalize a runner for not advancing from second following a strike out!
Baserunner Advancement
Unfortunately, StatCast does not compute the number of bases that each runner advances during each at-bat. The following code implements a function that determines the number of bases advanced by the runner on first (if any). It works by first checking whether there is anyone on 1b at the start of the at-bat. If so, it checks whether that player is on first, second, or third base at the end of the at-bat. If not, it parses the at-bat description contained in des and looks for a sentence containing the player’s name. If that sentence contains the words “out” or “caught stealing”, it sets the number of bases advanced to 0. But, if the sentence contains the word “score”, it sets the number of bases advanced to 3, since the runner scored from first.
load("player2024_lookup.RData")#| label: mvt-1b-functionmvt_1b <-function(on_1b, Outs, bat_score, end_on_1b, end_on_2b, end_on_3b, end_Outs, end_bat_score, des){ mvt <-NAif(!is.na(on_1b)){# there was someone on 1st base at the start of the at-batif(!is.na(end_on_1b) & on_1b == end_on_1b) mvt <-0if(!is.na(end_on_2b) & on_1b == end_on_2b) mvt <-1if(!is.na(end_on_3b) & on_1b == end_on_3b) mvt <-2if(is.na(mvt)){# either there are no baserunners at end of inning or# there are baserunners but none of them started on first# we need to parse the play# Start by grabbing the player name player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_1b)]# Start by splitting it a string play_split <- stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),pattern ="(?<=[[:punct:]])\\s(?=[A-Z])") check <-sapply(play_split, FUN = grepl, pattern = player_name)if(any(check)){# found something with player name in it play <- play_split[check]if( any(grepl(pattern ="out", x = play) |grepl(pattern ="caught stealing", x = play))) mvt <-0# player got outelseif(any(grepl(pattern ="score", x = play))) mvt <-3# player scored from 1st } else{# player name is not present in play description; and they're not on base# if they got caught stealing in the middle of the at-bat this may not be recorded# check if Outs < end_Outsif(end_Outs ==3| Outs < end_Outs & bat_score == end_bat_score) mvt <-0 } } } return(mvt)}
1
Runner remained on first, so they advanced 0 bases
2
Runner advanced 1 base (first to second)
3
Runner advanced 2 bases (first to third)
We similarly define functions to track the number of bases advanced by the runners on second and third base and by the batter. For brevity, we have folded the code.
Show code for computing the number of bases advanced by the batter and the runners on 2nd and 3rd base.
mvt_2b <-function(on_2b, Outs, bat_score, end_on_2b, end_on_3b, end_Outs, end_bat_score, des){ mvt <-NAif(!is.na(on_2b)){# there was someone on 2nd base at the start of the at-batif(!is.na(end_on_2b) & on_2b == end_on_2b) mvt <-0# runner remained on 2ndif(!is.na(end_on_3b) & on_2b == end_on_3b) mvt <-1# runner advanced to 3rd#if(end_Outs == 3) mvt <- 0 # inning ended ; there may be some edge cases here# e.g., in last at-bat there may be a wild pitch# https://www.espn.com/mlb/playbyplay/_/gameId/401568474 where runner scores and then batter gets out to end the inningif(is.na(mvt)){# either there are no baserunners at end of inning or# there are baserunners but none of them started on second# we need to parse the play# Start by grabbing the player name player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_2b)]# Start by splitting it a string play_split <- stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),pattern ="(?<=[[:punct:]])\\s(?=[A-Z])") check <-sapply(play_split, FUN = grepl, pattern = player_name)if(any(check)){# found something with player name in it play <- play_split[check]if( any(grepl(pattern ="out", x = play) |grepl(pattern ="caught stealing", x = play))) mvt <-0# player got outelseif(any(grepl(pattern ="score", x = play))) mvt <-2# player scored from 2nd } else{# player name is not present in play description; and they're not on base# if they got caught stealing in the middle of the at-bat this may not be recorded# check if Outs < end_Outsif(end_Outs ==3| Outs < end_Outs & bat_score == end_bat_score) mvt <-0 } } } return(mvt)}mvt_3b <-function(on_3b, Outs, bat_score, end_on_3b, end_Outs, end_bat_score, des){ mvt <-NAif(!is.na(on_3b)){if(!is.na(end_on_3b) & on_3b == end_on_3b) mvt <-0# runner remained on 3rdif(is.na(mvt)){# either there are no baserunners at end of inning or# there are baserunners but none of them started on second# we need to parse the play# Start by grabbing the player name player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_3b)] play_split <- stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),pattern ="(?<=[[:punct:]])\\s(?=[A-Z])") check <-sapply(play_split, FUN = grepl, pattern = player_name)if(any(check)){# found something with player name in it play <- play_split[check]if( any(grepl(pattern ="out", x = play) |grepl(pattern ="caught stealing", x = play))) mvt <-0# player got outelseif(any(grepl(pattern ="score", x = play))) mvt <-1# player scored from 3rd } else{# player name is not present in play description; and they're not on base# if they got caught stealing in the middle of the at-bat this may not be recorded# check if Outs < end_Outsif(end_Outs ==3| Outs < end_Outs & bat_score == end_bat_score) mvt <-0 } } } return(mvt)}mvt_batter <-function(batter, Outs, bat_score, end_on_1b, end_on_2b, end_on_3b, end_Outs, end_bat_score, des){ mvt <-NAif(!is.na(end_on_1b) & batter == end_on_1b) mvt <-1# batter advanced to 1stelseif(!is.na(end_on_2b) & batter == end_on_2b) mvt <-2# batter advanced to 2ndelseif(!is.na(end_on_3b) & batter == end_on_3b) mvt <-3# batter advanced to 3rdelse{# batter is not on base# look up player name player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == batter)] play_split <- stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),pattern ="(?<=[[:punct:]])\\s(?=[A-Z])") check <-sapply(play_split, FUN = grepl, pattern = player_name)if(any(check)){# found something with player name in it play <- play_split[check]if( any(grepl(pattern ="out", x = play))) mvt <-0# player got outelseif(any(grepl(pattern ="score", x = play) |grepl(pattern ="home", x = play))) mvt <-4# batter scoredelseif(end_Outs ==3| Outs < end_Outs & bat_score == end_bat_score) mvt <-0else mvt <-NA } }return(mvt)}
We can now apply these functions to every row of our data frame.
Now that we have computed the \(k_{ij}\)’s — that is, the number of bases each base runner advanced in each at-bat — we are ready to compute the cumulative base running probabilities \(\mathbb{P}(K \leq k \vert \textrm{e}).\) In the following code, we first group at-bats by the ending event and then compute the proportion of times that the baserunner advances at most \(k\) bases. We also set the cumulative probability to zero for situations when there isn’t a runner on a particular base.
The table br_1b_probs contains the cumulative base running probabilities for runners who start on first base broken down by ending event. We find that in about 64.7% of singles, the runner on first advances one base or fewer while in 95.7% of singles, the runner on first advances two bases or fewer.
# A tibble: 5 × 3
end_events mvt_1b kappa_1b
<chr> <dbl> <dbl>
1 single 0 0.0194
2 single 1 0.647
3 single 2 0.957
4 single 3 1
5 single NA 0
Baserunning Runs Above Average
Now that we have the cumulative base running probabilities, we’re (finally) ready to compute \(\kappa_{ij}.\) To do so, we will use inner_join()’s to add columns to our baserunning data table with columns for the batter and runners on first, second, and third. Note, whenever there is no baserunner on first base (i.e., on_1b = NA), we will set the corresponding \(\kappa\) to 0. Because we want to divide all of \(\eta_{i}\) amongst the base runners, we need to normalize the \(\kappa_{ij}\) values to sum to 1 within each at-bat. The columns norm_batter, norm_1b, norm_2b, and norm_3b contain these normalized weights.
To illustrate our calculations so far, let’s look at Ohtani’s at-bats from that game against the Padres. First, we see that Ohtani reached first base in all except his fourth at-bat. So, for these four at-bats, his mvt_batter value is 1.
# A tibble: 5 × 3
at_bat_number mvt_batter des
<int> <dbl> <chr>
1 2 1 Shohei Ohtani grounds into a force out, shortstop Ha…
2 18 1 Shohei Ohtani singles on a sharp line drive to right…
3 37 1 Shohei Ohtani grounds into a force out, third basema…
4 52 0 Shohei Ohtani grounds out softly, pitcher Wandy Pera…
5 65 1 Shohei Ohtani singles on a line drive to left fielde…
In his second at-bat, Ohtani singled with no runners on. So, he should get credit for creating all the base running run value above average on that at-bat. In contrast, we argued earlier that when he drove in a run in his fifth at-bat, the runner who scored from second should get a bit more credit than Ohtani and the runner on first, who only advanced one base. Looking at the weights norm_1b, norm_2b, and norm_batter for this at-bat, we see that indeed, we’re assigning a bit more weight to the runner on second than the runner on first.
# A tibble: 5 × 9
at_bat_number mvt_1b mvt_2b mvt_3b mvt_batter norm_1b norm_2b norm_3b
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0 NA NA 1 0.491 0 0
2 18 NA NA NA 1 0 0 0
3 37 0 NA NA 1 0.491 0 0
4 52 NA NA NA 0 0 0 0
5 65 1 2 NA 1 0.247 0.382 0
# ℹ 1 more variable: norm_batter <dbl>
Recall that \(\eta_{i}\) represents the run value above average generated in at-bat \(i\) due to base running. Whenever there is a runner on first at the start of the at-bat, the quantity \(\kappa_{i,\textrm{1b}}/\sum_{j}{\kappa_{ij}} \times \eta_{i}\) reflects the run value above average generated in the at-bat due to the base running of the runner initially on first. For each player, we can aggregate these values across all at-bats in which they are on first.
Finally, we can aggregate the total run value above average that each player creates from their base running, which we can \(\textrm{RAA}^{\textrm{br}}.\)
We’ve distributed the run value \(\delta_{i}\) created in each at-bat between the batter and base runners and computed season total runs values above average based on batting \(\textrm{RAA}^{\textrm{b}}\) and base running \(\textrm{RAA}^{\textrm{br}}.\)Next lecture, we will distribute \(-\delta_{i}\) between the pitcher and fielders involved in each at-bat. So that we don’t have to repeat our earlier calculations, we will save raa_br and raa_b
Baumer, Benjamin S., Shane T. Jensen, and Gregory J. Matthews. 2015. “openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball.”Journal of Quantitative Analysis in Sports 11 (2).
Footnotes
Starting in 2023, Major League Baseball implemented a pitch timer. Batters who were not in the batter’s box and alert to the pitcher by the 8-second mark of the timer are penalized with an automatic strike. See the rules here.↩︎
When this happens, Statcast usually records it as a truncated plate appearance (truncated_pa).↩︎