Lecture 7: Offensive Credit Allocation in Baseball

Overview

In Lecture 6, we computed the run value created by the offensive team in each at-bat of the 2024 MLB regular season. Run value is the sum of (i) the number of runs scored in the at-bat and (ii) the change in the number of runs the batting team is expected to score in the remainder of the half-inning. This change in expected runs is driven by the change in the combination of the number of outs and baserunner configuration. We then ranked players based on their run value totals, aggregating over all their at-bats. While the resulting rankings did appear to pass the “eye test” — both Aaron Judge and Shohei Ohtani created some of the largest run values — the metric implicitly gives batters all the credit for creating run value.

Over the next two lectures, we will develop our own version of wins above replacement. Our development largely follows that of Baumer, Jensen, and Matthews (2015) but with some important differences.

Conservation of Runs

The central idea — what (Baumer, Jensen, and Matthews 2015) call the “conservation of runs” framework — is that if the batting team gains \(\delta_{i}\) units of run value during an at-bat, the fielding team gains \(-\delta_{i}\) units of run value during that same at-bat. In this lecture, we will apportion \(\delta_{i}\) between the batters (Section 4) and the baserunners involved in at-bat \(i\) (Section 5) In Lecture 8, we will apportion \(-\delta_{i}\) between the pitcher and fielders involved in at-bat \(i.\)

Data Preparation

To divide up offensive run value, we need to create a data table whose rows correspond to individual at-bats. This data table must, at a minimum, contain the starting and ending outs and baserunner configurations as well as the identities of the baserunners at the start and end of the at-bat. We will also want to include the columns event and des, which record the events and a narrative description of what happened in the at-bat.

load("statcast2024.RData")
load("runValue2024.RData")
load("player2024_lookup.RData")
raw_atbat2024 <- 
  statcast2024 |>
  dplyr::group_by(game_pk, inning, inning_topbot) |>
  dplyr::arrange(at_bat_number, pitch_number) |>
  dplyr::mutate(
    next_Outs = dplyr::lead(Outs),
    next_BaseRunner = dplyr::lead(BaseRunner),
    next_on_1b = dplyr::lead(on_1b),
    next_on_2b = dplyr::lead(on_2b),
    next_on_3b = dplyr::lead(on_3b)) |>
  dplyr::ungroup() |>
  dplyr::group_by(game_pk, at_bat_number) |>
  dplyr::arrange(pitch_number) |>
  dplyr::mutate(
    end_Outs = dplyr::last(next_Outs),
    end_BaseRunner = dplyr::last(next_BaseRunner), # 
    end_on_1b = dplyr::last(next_on_1b),
    end_on_2b = dplyr::last(next_on_2b), 
    end_on_3b = dplyr::last(next_on_3b),  
    end_events = dplyr::last(events)) |>
  dplyr::ungroup() |>
  dplyr::filter(pitch_number == 1) |>
  dplyr::arrange(game_date, game_pk, at_bat_number, pitch_number) |>
  dplyr::mutate(end_bat_score = bat_score + RunsScored, end_fld_score = fld_score,
         end_Outs = ifelse(is.na(end_Outs), 3, end_Outs)) |>
  dplyr::select(game_date, game_pk, at_bat_number, inning, inning_topbot,
         Outs, BaseRunner, batter, on_1b, on_2b, on_3b, bat_score, fld_score, 
         end_Outs, end_BaseRunner, end_on_1b, end_on_2b, end_on_3b, end_bat_score, end_fld_score,
         end_events, des) |>
  dplyr::inner_join(y = runValue2024, by = c("game_pk", "at_bat_number"))
1
Divide by game and half-inning
2
Gets the value of several variables from the next pitch in the half-inning
3
Goes to last pitch of each at bat and gets next value of variable. That is, the starting value of the first pitch in the next at-bat.
4
For instance, this looks up who’s on first at end of the current at-bat/start of the next at-bat.
5
The variable events tells us what happened during the plate-appearance

Dealing with Missing Events

The column end_events in our data table raw_atbat2024 records what happened as a result of the at-bat. There are 308 rows with a missing entry.

table(raw_atbat2024$end_events)

                                     catcher_interf                    double 
                      308                        97                      7608 
              double_play               field_error                 field_out 
                      336                      1093                     72233 
          fielders_choice       fielders_choice_out                 force_out 
                      373                       306                      3408 
grounded_into_double_play              hit_by_pitch                  home_run 
                     3152                      1977                      5326 
                 sac_bunt                   sac_fly       sac_fly_double_play 
                      446                      1222                        13 
                   single                 strikeout     strikeout_double_play 
                    25363                     40145                       107 
                   triple               triple_play              truncated_pa 
                      685                         1                       304 
                     walk 
                    14029 

The column des includes a much more detailed description of what happened during the plate appearance. A cursory look through the values of des corresponding to rows with missing end_events reveals that several of these at-bats ended with a walk, involved an automatic strike1, or an inning-ending pick off2

raw_atbat2024 |>
  dplyr::filter(end_events == "") |>
  dplyr::slice_head(n = 15) |>
  dplyr::select(Outs, end_Outs, des)
# A tibble: 15 × 3
    Outs end_Outs des                                                           
   <int>    <dbl> <chr>                                                         
 1     0        0 Mookie Betts walks.                                           
 2     2        2 Freddie Freeman walks.                                        
 3     2        3 Xander Bogaerts strikes out on automatic strike.              
 4     2        2 Héctor Neris intentionally walks Wyatt Langford.              
 5     2        2 Logan Webb intentionally walks Ha-Seong Kim.                  
 6     2        2 Cole Ragans intentionally walks Carlos Santana.               
 7     1        2 Andrew Vaughn strikes out on automatic strike.                
 8     2        3 Oneil Cruz strikes out on automatic strike.                   
 9     2        3 Pitcher Bryce Miller picks off Wilyer Abreu at on throw to sh…
10     1        1 Yohan Ramírez intentionally walks Christian Yelich.           
11     1        2 Alex Kirilloff strikes out on automatic strike.               
12     1        1 Tony Kemp walks. James McCann to 2nd.                         
13     2        3 With Anthony Rendon batting, Zach Neto picked off and caught …
14     2        3 Miguel Sanó strikes out on automatic strike.                  
15     2        3 With Vinnie Pasquantino batting, Bobby Witt Jr. picked off an…

The following code manually corrects the missing values for end_events

atbat2024 <-
  raw_atbat2024 |>
  dplyr::mutate(
    end_events = dplyr::case_when(
      end_events == "" & grepl("walk", des) ~ "walk",
      end_events == "" & grepl("strikes out", des) ~ "strikeout",
      end_events == "" & end_Outs == 3 ~ "truncated_pa",
      end_events == "" & grepl("flies out", des) ~ "field_out",
      .default = end_events))
1
After accounting for the walks and strike outs on automatic strikes, all but one of the at-bats that still had a missing end_events value involved a pick-off that ended the inning
2
The remaining at-bat involved a fly out that was caught in foul territory.

Adjusted Run Values

We want to give credit to the batter and base runners for creating value over and above what would been expected given the game state and the actual outcome of the at-bat. More precisely, recall that \(\delta_{i}\) is the run value created in at-bat \(i.\) We will denote the game state at the beginning of the at-bat with \(\textrm{g}_{i}\) and the ending event with \(\textrm{e}_{i}.\) We form the game state variable \(\textrm{g}\) by concatenating the Outs and BaseRunners and separating them with a period so that \(\textrm{g} = "0.101"\) corresponds to a situation with no outs and runners on first and third base.

atbat2024 <-
  atbat2024 |>
  dplyr::mutate(GameState = paste(Outs, BaseRunner, sep = "."))

We will assume that the run value created in each at-bat beginning in state \(\textrm{g}\) and ending with event \(\textrm{e}\) is equal to the average run value created in all at-bats with the same beginning and end plus some mean-zero error. That is, for each at-bat \(i\), \[ \delta_{i} = \mathbb{E}[\delta \vert \textrm{g} = \textrm{g}_{i}, \textrm{e} = \textrm{e}_{i}] + \varepsilon_{i}, \] The average run value \(\mu:=\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}]\) represents the average run value created in at-bats that begin in state \(\textrm{g}\) and end with the event \(\textrm{e}.\)

It is tempting to compute the expectation \(\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}]\) using the “binning-and-averaging” approach we took when developing our initial XG models back in Lecture 2. Unfortunately, such a procedure is liable to yield extreme and erratic answers as the number of bins is quite large. To wit, there are 24 distinct game states (i.e., combinations of outs and base runners) and 21 different events.

The 2024 dataset contains only 373 of the 504 total combinations of game state and ending event. Of the observed combinations, there is a huge disparity in the relative frequencies. Some combinations (e.g., triples with no outs and runners on second and third) occurred just once while others (e.g., striking out with no outs and nobody on) occurred close to 10,000 times.

atbat2024 |>
  dplyr::count(Outs, BaseRunner, end_events) |>
  dplyr::arrange(n) |>
  dplyr::slice(c(1:5, (dplyr::n()-4):dplyr::n()))
# A tibble: 10 × 4
    Outs BaseRunner end_events                n
   <int> <chr>      <chr>                 <int>
 1     0 001        catcher_interf            1
 2     0 001        fielders_choice_out       1
 3     0 011        triple                    1
 4     0 101        strikeout_double_play     1
 5     0 101        triple_play               1
 6     1 000        strikeout              7490
 7     0 000        strikeout              9814
 8     2 000        field_out             11563
 9     1 000        field_out             14573
10     0 000        field_out             20148

Instead of “binning and averaging”, like we did with our distance-based XG models in Lecture 3, we will fit a statistical model. A natural starting model asserts that there are numbers \(\alpha_{0.000}, \ldots, \alpha_{2.111}\) and \(\alpha_{\textrm{catcher\_interf}}, \ldots, \alpha_{\textrm{walk}}\) such that for all game states \(\textrm{g}\) and ending events \(\textrm{e},\) \[ \mathbb{E}[\delta \vert \textrm{g}, \textrm{e}] = \alpha_{\textrm{g}} + \alpha_{\textrm{e}}. \]

Under the assumed model, the average run value created by hitting a single when there are two outs and no runners on is \(\alpha_{\textrm{2.000}} + \alpha_{\textrm{single}}\) while the average run value created by hitting a single when there are no outs and runners on first and second is \(\alpha_{\textrm{0.110}} + \alpha_{\textrm{single}}.\)

Because we do not know the exact values of the \(\alpha_{g}\)’s and \(\alpha_{e}\)’s, we need to estimate them using our data. Perhaps the simplest way is by solving a least squares minimization problem \[ \hat{\boldsymbol{\alpha}} = \textrm{argmin} \sum_{i = 1}^{n}{(\delta_{i} - \alpha_{g_{i}} - \alpha_{e_{i}})^2}, \] where \(g_{i}\) and \(e_{i}\) record the game state and event of at-bat \(i.\)

Solving this problem is equivalent to fitting a linear regression model without an intercept3. We can do this in R using the lm() function and including -1 in the formula argument4. In the following code, we create a temporary data frame that extracts just the run values \(\delta\), game states \(\textrm{g}\), and ending events \(\textrm{e}\) from atbats2024 and convert the game state and event variables into factors.

tmp_df <-
  atbat2024 |>
  dplyr::select(RunValue, GameState, end_events) |>
  dplyr::mutate(
    GameState = factor(GameState),
    end_events = factor(end_events))

state_event_fit <-
  lm(RunValue ~ -1 + GameState + end_events, data = tmp_df)

Using the summary() function, we can take a quick look at the \(\alpha\)’s.

summary(state_event_fit)

Call:
lm(formula = RunValue ~ -1 + GameState + end_events, data = tmp_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.65924 -0.10891  0.01957  0.08867  2.24020 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
GameState0.000                       0.344318   0.024078  14.300  < 2e-16 ***
GameState0.001                       0.240733   0.027701   8.690  < 2e-16 ***
GameState0.010                       0.392981   0.024448  16.074  < 2e-16 ***
GameState0.011                       0.272952   0.026280  10.386  < 2e-16 ***
GameState0.100                       0.391419   0.024160  16.201  < 2e-16 ***
GameState0.101                       0.237505   0.025347   9.370  < 2e-16 ***
GameState0.110                       0.407975   0.024500  16.652  < 2e-16 ***
GameState0.111                       0.369993   0.025731  14.379  < 2e-16 ***
GameState1.000                       0.348466   0.024094  14.463  < 2e-16 ***
GameState1.001                       0.253997   0.024902  10.200  < 2e-16 ***
GameState1.010                       0.354195   0.024327  14.560  < 2e-16 ***
GameState1.011                       0.265455   0.025049  10.597  < 2e-16 ***
GameState1.100                       0.396792   0.024121  16.450  < 2e-16 ***
GameState1.101                       0.332078   0.024655  13.469  < 2e-16 ***
GameState1.110                       0.412645   0.024311  16.974  < 2e-16 ***
GameState1.111                       0.359217   0.024860  14.449  < 2e-16 ***
GameState2.000                       0.356264   0.024096  14.785  < 2e-16 ***
GameState2.001                       0.346744   0.024549  14.124  < 2e-16 ***
GameState2.010                       0.318496   0.024256  13.131  < 2e-16 ***
GameState2.011                       0.321323   0.024888  12.911  < 2e-16 ***
GameState2.100                       0.351540   0.024132  14.567  < 2e-16 ***
GameState2.101                       0.363065   0.024474  14.835  < 2e-16 ***
GameState2.110                       0.354515   0.024269  14.608  < 2e-16 ***
GameState2.111                       0.338946   0.024651  13.750  < 2e-16 ***
end_eventsdouble                     0.398473   0.024206  16.461  < 2e-16 ***
end_eventsdouble_play               -1.293011   0.027320 -47.328  < 2e-16 ***
end_eventsfield_error                0.097087   0.025100   3.868  0.00011 ***
end_eventsfield_out                 -0.589994   0.024071 -24.510  < 2e-16 ***
end_eventsfielders_choice            0.342555   0.027015  12.680  < 2e-16 ***
end_eventsfielders_choice_out       -0.963020   0.027678 -34.794  < 2e-16 ***
end_eventsforce_out                 -0.713014   0.024403 -29.219  < 2e-16 ***
end_eventsgrounded_into_double_play -1.195243   0.024442 -48.901  < 2e-16 ***
end_eventshit_by_pitch              -0.001119   0.024636  -0.045  0.96376    
end_eventshome_run                   1.043303   0.024271  42.985  < 2e-16 ***
end_eventssac_bunt                  -0.443274   0.026600 -16.664  < 2e-16 ***
end_eventssac_fly                   -0.341349   0.025120 -13.589  < 2e-16 ***
end_eventssac_fly_double_play       -0.887624   0.070048 -12.672  < 2e-16 ***
end_eventssingle                     0.115000   0.024099   4.772 1.83e-06 ***
end_eventsstrikeout                 -0.618027   0.024083 -25.662  < 2e-16 ***
end_eventsstrikeout_double_play     -1.041349   0.033231 -31.337  < 2e-16 ***
end_eventstriple                     0.704276   0.025700  27.404  < 2e-16 ***
end_eventstriple_play               -2.133826   0.238222  -8.957  < 2e-16 ***
end_eventstruncated_pa              -0.625189   0.026806 -23.323  < 2e-16 ***
end_eventswalk                      -0.024163   0.024135  -1.001  0.31675    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2369 on 178488 degrees of freedom
Multiple R-squared:  0.7682,    Adjusted R-squared:  0.7682 
F-statistic: 1.344e+04 on 44 and 178488 DF,  p-value: < 2.2e-16

We estimate \(\hat{\alpha}_{2.000} \approx 0.356\) and \(\hat{\alpha}_{single} \approx 0.115.\) So, according to our fitted model the average run value created by singling when there are two outs and no runners on is about \(0.471.\)

Statistical Significance & Model Assumptions

You’ll notice that summary() returns a lot of inferential output (e.g., standard errors, p-values). These are computed under an additional assumption that the true errors \(\varepsilon_{i}\) are independent and following a mean-zero normal distribution with constant variance. Since our main interest is prediction, we’re really not interested in checking whether, say, \(\alpha_{0:010}\) is statistically significantly different than zero. So, we will not check whether the usual multiple linear model assumptions nor will we attempt. If you did want to make inferential statements about our model parameters, you would need to first check that the multiple linear multiple assumptions are not grossly violated.

Equipped with our estimated model parameters, for each at-bat \(i,\) let \(\hat{\mu}_{i} = \hat{\alpha}_{\textrm{g}_{i}} + \hat{\alpha}_{\textrm{e}_{i}}\) and let \(\eta_{i} = \delta_{i} - \hat{\mu}_{i}.\) In terms of dividing credit between the batter and the base runner, we will follow Baumer, Jensen, and Matthews (2015) and attribute \(\hat{\mu}_{i}\) to the batter’s hitting in at-bat \(i\) and \(\eta_{i}\) to the base running in that at-bat. We will add columns to atbat2024 holding the values of \(\hat{\mu}\) (mu) and \(\eta\) (eta).

atbat2024$eta <- state_event_fit$residuals
atbat2024$mu <- state_event_fit$fitted.values
save(atbat2024, file = "atbat2024.RData")

Baserunning Run Value

Ohtani’s second hit against the Padres on March 20, 2024 was a single with runners on first and second and no outs. While Ohtani and the runner originally on first advanced one base, the runner originally on second scored. Because this latter runner advanced more than what might have been otherwise expected, it makes sense to give him a larger share of the \(\eta_{i}\) than to the first two runners, who only advanced one base on a single. Following (Baumer, Jensen, and Matthews 2015, sec. 3.2), the amount of base running run value \(\eta_{i}\) that we assign to base runner \(j\) in at-bat \(i\) will be proportional to \(\kappa_{ij} = \mathbb{P}(K < k_{ij} \vert \textrm{e}_{i}),\) where \(k_{ij}\) is the number of bases actually advanced by the base runner.

Essentially, \(\kappa_{ij}\) is the probability that a typical base runner advanced at most the \(k_{ij}\) bases advanced by base runner \(j\) in at-bat \(i\) following event \(\textrm{e}_{i}.\) If the base runner does worse than expected (e.g., not advancing from second on a single), then \(\kappa_{ij}\) will be very small. But if the base runner does better than expected (e.g., scoring from second on a single), then \(\kappa_{ij}\) will be larger. When computing \(\kappa_{ij}\) it is crucial that we condition on the actual ending event \(\textrm{e}_{i}.\) After all, while we may want to penalize a runner for not advancing from second on a single, we definitely don’t want to penalize a runner for not advancing from second following a strike out!

Baserunner Advancement

Unfortunately, StatCast does not compute the number of bases that each runner advances during each at-bat. The following code implements a function that determines the number of bases advanced by the runner on first (if any). It works by first checking whether there is anyone on 1b at the start of the at-bat. If so, it checks whether that player is on first, second, or third base at the end of the at-bat. If not, it parses the at-bat description contained in des and looks for a sentence containing the player’s name. If that sentence contains the words “out” or “caught stealing”, it sets the number of bases advanced to 0. But, if the sentence contains the word “score”, it sets the number of bases advanced to 3, since the runner scored from first.

load("player2024_lookup.RData")
#| label: mvt-1b-function
mvt_1b <- function(on_1b, Outs, bat_score,
                   end_on_1b, end_on_2b, end_on_3b, end_Outs, end_bat_score,
                   des){
  mvt <- NA
  if(!is.na(on_1b)){
    # there was someone on 1st base at the start of the at-bat
    if(!is.na(end_on_1b) & on_1b == end_on_1b) mvt <- 0
    if(!is.na(end_on_2b) & on_1b == end_on_2b) mvt <- 1
    if(!is.na(end_on_3b) & on_1b == end_on_3b) mvt <- 2

    if(is.na(mvt)){
      # either there are no baserunners at end of inning or
      # there are baserunners but none of them started on first
      # we need to parse the play
      # Start by grabbing the player name
      player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_1b)]
      # Start by splitting it a string
      play_split <- 
        stringr::str_split_1(
          string = stringi::stri_trans_general(des, "Latin-ASCII"),
          pattern = "(?<=[[:punct:]])\\s(?=[A-Z])")
      
      check <- sapply(play_split, FUN = grepl, pattern = player_name)
      if(any(check)){
        # found something with player name in it
        play <- play_split[check]
        if( any(grepl(pattern = "out", x = play) | grepl(pattern = "caught stealing", x = play))) mvt <- 0 # player got out
        else if(any(grepl(pattern = "score", x = play))) mvt <- 3 # player scored from 1st
      } else{
        # player name is not present in play description; and they're not on base
        # if they got caught stealing in the middle of the at-bat this may not be recorded
        # check if Outs < end_Outs
        if(end_Outs == 3 | Outs < end_Outs & bat_score == end_bat_score) mvt <- 0
      } 
    }
  } 
  return(mvt)
}
1
Runner remained on first, so they advanced 0 bases
2
Runner advanced 1 base (first to second)
3
Runner advanced 2 bases (first to third)

We similarly define functions to track the number of bases advanced by the runners on second and third base and by the batter. For brevity, we have folded the code.

Show code for computing the number of bases advanced by the batter and the runners on 2nd and 3rd base.
mvt_2b <- function(on_2b, Outs, bat_score,
                   end_on_2b, end_on_3b, end_Outs, end_bat_score,
                   des){
  mvt <- NA
  if(!is.na(on_2b)){
    # there was someone on 2nd base at the start of the at-bat
    if(!is.na(end_on_2b) & on_2b == end_on_2b) mvt <- 0 # runner remained on 2nd
    if(!is.na(end_on_3b) & on_2b == end_on_3b) mvt <- 1 # runner advanced to 3rd
    
    #if(end_Outs == 3) mvt <- 0 # inning ended ; there may be some edge cases here
    # e.g., in last at-bat there may be a wild pitch
    # https://www.espn.com/mlb/playbyplay/_/gameId/401568474 where runner scores and then batter gets out to end the inning
    
    if(is.na(mvt)){
      # either there are no baserunners at end of inning or
      # there are baserunners but none of them started on second
      # we need to parse the play
      # Start by grabbing the player name
      player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_2b)]
      # Start by splitting it a string
      play_split <- 
        stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),
                    pattern = "(?<=[[:punct:]])\\s(?=[A-Z])")
      
      check <- sapply(play_split, FUN = grepl, pattern = player_name)
      if(any(check)){
        # found something with player name in it
        play <- play_split[check]
        if( any(grepl(pattern = "out", x = play) | grepl(pattern = "caught stealing", x = play))) mvt <- 0 # player got out
        else if(any(grepl(pattern = "score", x = play))) mvt <- 2 # player scored from 2nd
      } else{
        # player name is not present in play description; and they're not on base
        # if they got caught stealing in the middle of the at-bat this may not be recorded
        # check if Outs < end_Outs
        if(end_Outs == 3 | Outs < end_Outs & bat_score == end_bat_score) mvt <- 0
      } 
    }
  } 
  return(mvt)
}


mvt_3b <- function(on_3b, Outs, bat_score,
                   end_on_3b, end_Outs, end_bat_score,
                   des){
  mvt <- NA
  if(!is.na(on_3b)){
    if(!is.na(end_on_3b) & on_3b == end_on_3b) mvt <- 0 # runner remained on 3rd
    
    if(is.na(mvt)){
      # either there are no baserunners at end of inning or
      # there are baserunners but none of them started on second
      # we need to parse the play
      # Start by grabbing the player name
      player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == on_3b)]
      play_split <- 
        stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),
                    pattern = "(?<=[[:punct:]])\\s(?=[A-Z])")
      check <- sapply(play_split, FUN = grepl, pattern = player_name)
      if(any(check)){
        # found something with player name in it
        play <- play_split[check]
        if( any(grepl(pattern = "out", x = play) | grepl(pattern = "caught stealing", x = play))) mvt <- 0 # player got out
        else if(any(grepl(pattern = "score", x = play))) mvt <- 1 # player scored from 3rd
      } else{
        # player name is not present in play description; and they're not on base
        # if they got caught stealing in the middle of the at-bat this may not be recorded
        # check if Outs < end_Outs
        if(end_Outs == 3 | Outs < end_Outs & bat_score == end_bat_score) mvt <- 0
      }
    }
  } 
  return(mvt)
}


mvt_batter <- function(batter, Outs, bat_score, end_on_1b, end_on_2b, end_on_3b, end_Outs, end_bat_score, des)
{
  mvt <- NA
  if(!is.na(end_on_1b) & batter == end_on_1b) mvt <- 1 # batter advanced to 1st
  else if(!is.na(end_on_2b) & batter == end_on_2b) mvt <- 2 # batter advanced to 2nd
  else if(!is.na(end_on_3b) & batter == end_on_3b) mvt <- 3 # batter advanced to 3rd
  else{
    # batter is not on base
    # look up player name
    player_name <- player2024_lookup$Name[which(player2024_lookup$key_mlbam == batter)]
    
    play_split <-
      stringr::str_split_1(string = stringi::stri_trans_general(des, "Latin-ASCII"),
                           pattern = "(?<=[[:punct:]])\\s(?=[A-Z])")
    
    check <- sapply(play_split, FUN = grepl, pattern = player_name)
    if(any(check)){
      # found something with player name in it
      play <- play_split[check]
      if( any(grepl(pattern = "out", x = play))) mvt <- 0 # player got out
      else if(any(grepl(pattern = "score", x = play) | grepl(pattern = "home", x = play))) mvt <- 4 # batter scored
      else if(end_Outs == 3 | Outs < end_Outs & bat_score == end_bat_score) mvt <- 0
      else mvt <- NA
    }
  }
  return(mvt)
}

We can now apply these functions to every row of our data frame.

Warning

The following code takes a few minutes to run.

baserunning <-
  atbat2024 |>
  dplyr::rowwise() |>
  dplyr::mutate(
    mvt_batter = mvt_batter(batter, Outs, bat_score, end_on_1b, end_on_2b, end_on_3b, end_Outs, end_bat_score, des),
    mvt_1b = mvt_1b(on_1b, Outs, bat_score, end_on_1b, end_on_2b, end_on_3b,end_Outs, end_bat_score, des),
    mvt_2b = mvt_2b(on_2b, Outs, bat_score, end_on_2b, end_on_3b, end_Outs, end_bat_score, des),
    mvt_3b = mvt_3b(on_3b, Outs, bat_score, end_on_3b, end_Outs, end_bat_score, des)) |>
  dplyr::ungroup() 

Cumulative Baserunning Probabilities

Now that we have computed the \(k_{ij}\)’s — that is, the number of bases each base runner advanced in each at-bat — we are ready to compute the cumulative base running probabilities \(\mathbb{P}(K \leq k \vert \textrm{e}).\) In the following code, we first group at-bats by the ending event and then compute the proportion of times that the baserunner advances at most \(k\) bases. We also set the cumulative probability to zero for situations when there isn’t a runner on a particular base.

br_batter_probs <-
  baserunning |>
  dplyr::group_by(end_events) |>
  dplyr::summarize(
    kappa_0 = mean(mvt_batter <= 0, na.rm = TRUE),
    kappa_1 = mean(mvt_batter <= 1, na.rm = TRUE),
    kappa_2 = mean(mvt_batter <= 2, na.rm = TRUE),
    kappa_3 = mean(mvt_batter <= 3, na.rm = TRUE),
    kappa_4 = mean(mvt_batter <= 4, na.rm = TRUE),
    kappa_NA = 0) |>
  tidyr::pivot_longer(cols = tidyr::starts_with("kappa_"),
                      names_to = "mvt_batter",
                      names_prefix = "kappa_",
                      values_to = "kappa_batter") |>
  dplyr::mutate(
    mvt_batter = ifelse(mvt_batter == "NA", NA, mvt_batter),
    mvt_batter = as.numeric(mvt_batter))
br_1b_probs <-
  baserunning |>
  dplyr::filter(!is.na(on_1b)) |>
  dplyr::group_by(end_events) |>
  dplyr::summarize(
    kappa_0 = mean(mvt_1b <= 0, na.rm = TRUE),
    kappa_1 = mean(mvt_1b <= 1, na.rm = TRUE),
    kappa_2 = mean(mvt_1b <= 2, na.rm = TRUE),
    kappa_3 = mean(mvt_1b <= 3, na.rm = TRUE),
    kappa_NA = 0) |>
  tidyr::pivot_longer(cols = tidyr::starts_with("kappa_"),
                      names_to = "mvt_1b",
                      names_prefix = "kappa_",
                      values_to = "kappa_1b") |>
  dplyr::mutate(mvt_1b = ifelse(mvt_1b == "NA", NA, mvt_1b),
                mvt_1b = as.numeric(mvt_1b))

br_2b_probs <-
  baserunning |>
  dplyr::filter(!is.na(on_2b)) |>
  dplyr::group_by(end_events) |>
  dplyr::summarize(
    kappa_0 = mean(mvt_2b <= 0, na.rm = TRUE),
    kappa_1 = mean(mvt_2b <= 1, na.rm = TRUE),
    kappa_2 = mean(mvt_2b <= 2, na.rm = TRUE),
    kappa_NA = 0) |>
  tidyr::pivot_longer(cols = tidyr::starts_with("kappa_"),
                      names_to = "mvt_2b",
                      names_prefix = "kappa_",
                      values_to = "kappa_2b") |>
  dplyr::mutate(mvt_2b = ifelse(mvt_2b == "NA", NA, mvt_2b),
                mvt_2b = as.numeric(mvt_2b))

br_3b_probs <-
  baserunning |>
  dplyr::filter(!is.na(on_3b)) |>
  dplyr::group_by(end_events) |>
  dplyr::summarize(
    kappa_0 = mean(mvt_3b <= 0, na.rm = TRUE),
    kappa_1 = mean(mvt_3b <= 1, na.rm = TRUE),
    kappa_NA = 0) |>
  tidyr::pivot_longer(cols = tidyr::starts_with("kappa_"),
                      names_to = "mvt_3b",
                      names_prefix = "kappa_",
                      values_to = "kappa_3b") |>
  dplyr::mutate(mvt_3b = ifelse(mvt_3b == "NA", NA, mvt_3b),
                mvt_3b = as.numeric(mvt_3b))

The table br_1b_probs contains the cumulative base running probabilities for runners who start on first base broken down by ending event. We find that in about 64.7% of singles, the runner on first advances one base or fewer while in 95.7% of singles, the runner on first advances two bases or fewer.

br_1b_probs |>
  dplyr::filter(end_events == "single")
# A tibble: 5 × 3
  end_events mvt_1b kappa_1b
  <chr>       <dbl>    <dbl>
1 single          0   0.0194
2 single          1   0.647 
3 single          2   0.957 
4 single          3   1     
5 single         NA   0     

Baserunning Runs Above Average

Now that we have the cumulative base running probabilities, we’re (finally) ready to compute \(\kappa_{ij}.\) To do so, we will use inner_join()’s to add columns to our baserunning data table with columns for the batter and runners on first, second, and third. Note, whenever there is no baserunner on first base (i.e., on_1b = NA), we will set the corresponding \(\kappa\) to 0. Because we want to divide all of \(\eta_{i}\) amongst the base runners, we need to normalize the \(\kappa_{ij}\) values to sum to 1 within each at-bat. The columns norm_batter, norm_1b, norm_2b, and norm_3b contain these normalized weights.

baserunning <-
  baserunning |>
  dplyr::inner_join(y = br_batter_probs, by = c("end_events", "mvt_batter")) |>
  dplyr::inner_join(y = br_1b_probs, by = c("end_events", "mvt_1b")) |>
  dplyr::inner_join(y = br_2b_probs, by = c("end_events", "mvt_2b")) |>
  dplyr::inner_join(y = br_3b_probs, by = c("end_events", "mvt_3b")) |>
  dplyr::mutate(
    total_kappa = kappa_batter + kappa_1b + kappa_2b + kappa_3b,
    norm_batter = kappa_batter/total_kappa,
    norm_1b = kappa_1b/total_kappa,
    norm_2b = kappa_2b/total_kappa,
    norm_3b = kappa_3b/total_kappa)

To illustrate our calculations so far, let’s look at Ohtani’s at-bats from that game against the Padres. First, we see that Ohtani reached first base in all except his fourth at-bat. So, for these four at-bats, his mvt_batter value is 1.

load("player2024_lookup.RData")
ohtani_id <- 
  player2024_lookup |>
  dplyr::filter(FullName == "Shohei Ohtani") |>
  dplyr::pull(key_mlbam)

baserunning |>
  dplyr::filter(game_pk == 745444 & batter == ohtani_id) |>
  dplyr::select(at_bat_number, mvt_batter, des)
# A tibble: 5 × 3
  at_bat_number mvt_batter des                                                  
          <int>      <dbl> <chr>                                                
1             2          1 Shohei Ohtani grounds into a force out, shortstop Ha…
2            18          1 Shohei Ohtani singles on a sharp line drive to right…
3            37          1 Shohei Ohtani grounds into a force out, third basema…
4            52          0 Shohei Ohtani grounds out softly, pitcher Wandy Pera…
5            65          1 Shohei Ohtani singles on a line drive to left fielde…

In his second at-bat, Ohtani singled with no runners on. So, he should get credit for creating all the base running run value above average on that at-bat. In contrast, we argued earlier that when he drove in a run in his fifth at-bat, the runner who scored from second should get a bit more credit than Ohtani and the runner on first, who only advanced one base. Looking at the weights norm_1b, norm_2b, and norm_batter for this at-bat, we see that indeed, we’re assigning a bit more weight to the runner on second than the runner on first.

baserunning |>
  dplyr::filter(game_pk == 745444 & batter == ohtani_id) |>
  dplyr::select(at_bat_number, mvt_1b, mvt_2b, mvt_3b, mvt_batter, norm_1b, norm_2b, norm_3b, norm_batter)
# A tibble: 5 × 9
  at_bat_number mvt_1b mvt_2b mvt_3b mvt_batter norm_1b norm_2b norm_3b
          <int>  <dbl>  <dbl>  <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
1             2      0     NA     NA          1   0.491   0           0
2            18     NA     NA     NA          1   0       0           0
3            37      0     NA     NA          1   0.491   0           0
4            52     NA     NA     NA          0   0       0           0
5            65      1      2     NA          1   0.247   0.382       0
# ℹ 1 more variable: norm_batter <dbl>

Recall that \(\eta_{i}\) represents the run value above average generated in at-bat \(i\) due to base running. Whenever there is a runner on first at the start of the at-bat, the quantity \(\kappa_{i,\textrm{1b}}/\sum_{j}{\kappa_{ij}} \times \eta_{i}\) reflects the run value above average generated in the at-bat due to the base running of the runner initially on first. For each player, we can aggregate these values across all at-bats in which they are on first.

raa_1b <-
  baserunning |>
  dplyr::filter(!is.na(on_1b)) |>
  dplyr::mutate(RAA_1b = norm_1b * eta) |>
  dplyr::group_by(on_1b) |>
  dplyr::summarise(RAA_1b = sum(RAA_1b)) |>
  dplyr::rename(key_mlbam = on_1b)

We can similarly compute the run value above average created by the runners on second base and third base as well as by the batter.

Compute the baserunning runs above average created by batter and the runners on second and third bases
raa_2b <-
  baserunning |>
  dplyr::filter(!is.na(on_2b)) |>
  dplyr::mutate(RAA_2b = norm_2b * eta) |>
  dplyr::group_by(on_2b) |>
  dplyr::summarise(RAA_2b = sum(RAA_2b)) |>
  dplyr::rename(key_mlbam = on_2b)

raa_3b <-
  baserunning |>
  dplyr::filter(!is.na(on_3b)) |>
  dplyr::mutate(RAA_3b = norm_3b * eta) |>
  dplyr::group_by(on_3b) |>
  dplyr::summarise(RAA_3b = sum(RAA_3b)) |>
  dplyr::rename(key_mlbam = on_3b)

raa_batter <-
  baserunning |>
  dplyr::mutate(RAA_batter = norm_batter * eta) |>
  dplyr::group_by(batter) |>
  dplyr::summarise(RAA_batter = sum(RAA_batter)) |>
  dplyr::rename(key_mlbam = batter)

Finally, we can aggregate the total run value above average that each player creates from their base running, which we can \(\textrm{RAA}^{\textrm{br}}.\)

raa_br <-
  raa_batter |>
  dplyr::full_join(y = raa_1b, by = "key_mlbam") |>
  dplyr::full_join(y = raa_2b, by = "key_mlbam") |>
  dplyr::full_join(y = raa_3b, by = "key_mlbam") |>
  tidyr::replace_na(list(RAA_batter = 0, RAA_1b = 0, RAA_2b = 0, RAA_3b = 0)) |>
  dplyr::mutate(RAA_br = RAA_batter + RAA_1b + RAA_2b + RAA_3b) |>
  dplyr::inner_join(y = player2024_lookup, by = "key_mlbam") |>
  dplyr::select(Name, key_mlbam, RAA_br, RAA_batter, RAA_1b, RAA_2b, RAA_3b)

Interestingly, the player with the largest \(\textrm{RAA}^{\textrm{br}}\), Jose Ramirez, is known for his aggressive baserunning5.

Crediting Batters

We will credit \(\hat{\mu}_{i}\) to each batter. The following code block totals the \(\hat{\mu}_{i}\) for each batter.

batting <- 
  atbat2024 |>
  dplyr::select(batter, mu) |>
  dplyr::rename(key_mlbam = batter)

raa_b <-
  batting |>
  dplyr::group_by(key_mlbam) |>
  dplyr::summarise(RAA_b = sum(mu)) |>
  dplyr::left_join(y = player2024_lookup, by = "key_mlbam") |>
  dplyr::select(Name, key_mlbam, RAA_b)

We find a lot of very good batters among the players with highest \(\textrm{RAA}^{\textrm{b}}\)’s

raa_b |>
  dplyr::arrange(dplyr::desc(RAA_b)) |>
  dplyr::slice_head(n=10)
# A tibble: 10 × 3
   Name              key_mlbam RAA_b
   <chr>                 <int> <dbl>
 1 Aaron Judge          592450  85.1
 2 Shohei Ohtani        660271  70.8
 3 Juan Soto            665742  67.7
 4 Bobby Witt           677951  61.1
 5 Vladimir Guerrero    665489  47.3
 6 Gunnar Henderson     683002  44.8
 7 Brent Rooker         667670  44.3
 8 Yordan Alvarez       670541  42.1
 9 Marcell Ozuna        542303  36.3
10 Ketel Marte          606466  35.2

Looking Ahead

We’ve distributed the run value \(\delta_{i}\) created in each at-bat between the batter and base runners and computed season total runs values above average based on batting \(\textrm{RAA}^{\textrm{b}}\) and base running \(\textrm{RAA}^{\textrm{br}}.\) Next lecture, we will distribute \(-\delta_{i}\) between the pitcher and fielders involved in each at-bat. So that we don’t have to repeat our earlier calculations, we will save raa_br and raa_b

save(raa_br, raa_b, file = "raa_offensive2024.RData")

References

Baumer, Benjamin S., Shane T. Jensen, and Gregory J. Matthews. 2015. openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball.” Journal of Quantitative Analysis in Sports 11 (2).

Footnotes

  1. Starting in 2023, Major League Baseball implemented a pitch timer. Batters who were not in the batter’s box and alert to the pitcher by the 8-second mark of the timer are penalized with an automatic strike. See the rules here.↩︎

  2. When this happens, Statcast usually records it as a truncated plate appearance (truncated_pa).↩︎

  3. Try proving this mathematically!↩︎

  4. Check out the documentation for R’s formula interface here. Specifically, look at the bullet point about the - operations under “Details”↩︎

  5. See this article about his base running from earlier this year.↩︎