STAT 479 Lecture 7

Offensive Credit Allocation

Overview

Recap

Expected runs: avg. runs scored in remainder of half-inning from game state
- Game state characterized by Outs & Baserunner configuration
- Baserunner strings: "101" means runner on 1st & 3rd, nobody on 2nd
- Estimated \(\rho(\textrm{o}, \textrm{br})\) using all at-bats in 2024
During at-bat game state changes from \((\textrm{o}_{\text{start}}, \textrm{br}_{\text{start}})\) to \((\textrm{o}_{\text{end}}, \textrm{br}_{\text{end}})\)

\[ \textrm{RunValue} = \textrm{RunsScored} +\rho(\textrm{o}_{\text{end}}, \textrm{br}_{\text{end}}) - \rho(\textrm{o}_{\text{start}}, \textrm{br}_{\text{start}}) \]

Several top batters created the most run value
Do batters deserve all the credit? How much credit to baserunners?

Road to WAR

Today: divide \(\textrm{RunValue}\) b/w batter & baserunner
Conservation of runs: if batting team creates \(\delta\) run value then fielding team creates \(-\delta\)

Ultimately, we’ll sum each player’s across across each phase
- Offensive: Batting & Baserunning (today)
- Defensive: Pitching & Fielding (Lecture 8)
We’ll then develop a version of wins above replacement (WAR)
- Introduce a roster-based definition of “replacement level”
- Estimate the performance of a replacement-level “shadow” for each player
- Convert runs to wins

Data Preparation (Overview)

Create a table with rows for each at-bat and columns for
- Starting & ending Outs and BaseRunner, RunValue
- batter, on_1b, on_2b, on_3b
- end_event & des: narrative description of what happened in at-bat

Some entries in end_events are missing


                                     catcher_interf                    double 
                      308                        97                      7608 
              double_play               field_error                 field_out 
                      336                      1093                     72233 
          fielders_choice       fielders_choice_out                 force_out 
                      373                       306                      3408 
grounded_into_double_play              hit_by_pitch                  home_run 
                     3152                      1977                      5326 
                 sac_bunt                   sac_fly       sac_fly_double_play 
                      446                      1222                        13 
                   single                 strikeout     strikeout_double_play 
                    25363                     40145                       107 
                   triple               triple_play              truncated_pa 
                      685                         1                       304 
                     walk 
                    14029

Additional Prep

See lecture notes for full code
grepl(pattern, x) return TRUE if pattern found in the string x

Missing Events
Manual Correction
GameState

# A tibble: 4 × 3
   Outs end_Outs des                                                            
  <int>    <dbl> <chr>                                                          
1     0        0 Mookie Betts walks.                                            
2     2        3 Xander Bogaerts strikes out on automatic strike.               
3     2        2 Logan Webb intentionally walks Ha-Seong Kim.                   
4     2        3 With Vinnie Pasquantino batting, Bobby Witt Jr. picked off and…

[1] "With Vinnie Pasquantino batting, Bobby Witt Jr. picked off and caught stealing 2nd base, first baseman Ryan Mountcastle."

atbat2024 <-
  raw_atbat2024 |>
  dplyr::mutate(
    end_events = dplyr::case_when(
      end_events == "" & grepl("walk", des) ~ "walk",
      end_events == "" & grepl("strikes out", des) ~ "strikeout",
      end_events == "" & end_Outs == 3 ~ "truncated_pa",
      end_events == "" & grepl("flies out", des) ~ "field_out",
      .default = end_events))

Concatenate Outs and Baserunner: e.g. `“0.101”: 0 outs, runners on 1st & 3rd

atbat2024 <-
  atbat2024 |>
  dplyr::mutate(GameState = paste(Outs, BaseRunner, sep = "."))

Preliminary Computations

Adjusted Run Value

For at-bat starting in state \(g\) and ending in \(e\), how much run value should we expect batting team to create?
- Expect more run value w/ fewer outs
- At-bats ending w/ home runs have more run value than those ending w/ strike outs
How do players perform over and above expectations?

\(\delta_{i} = \mathbb{E}[\delta \vert \textrm{g} = \textrm{g}_{i}, \textrm{e} = \textrm{e}_{i}] + \epsilon_{i}\)
Let \(\mu = \mathbb{E}[\delta \vert \textrm{g}, \textrm{e}]\):

Game State & Event Combinations

Can’t use simple grouped summary / “binning and averaging”

# A tibble: 10 × 3
   GameState end_events                n
   <chr>     <chr>                 <int>
 1 0.001     catcher_interf            1
 2 0.001     fielders_choice_out       1
 3 0.011     triple                    1
 4 0.101     strikeout_double_play     1
 5 0.101     triple_play               1
 6 1.000     strikeout              7490
 7 0.000     strikeout              9814
 8 2.000     field_out             11563
 9 1.000     field_out             14573
10 0.000     field_out             20148

A Model for \(\mu\)

\(\mathbb{E}[\delta \vert \textrm{g}, \textrm{e}] = \alpha_{\textrm{g}} + \alpha_{\textrm{e}}.\)
Introduce
- Game-state parameters: \(\alpha_{0.000}, \ldots, \alpha_{2.111}\)
- Event parameters: \(\alpha_{\textrm{catcher\_interf}}, \ldots, \alpha_{\textrm{walk}}\)
Avg. run value for hitting single w/ 2 outs and no runners on: \(\alpha_{\textrm{2.000}} + \alpha_{\textrm{single}}\)

We must estimate \(\alpha_{\textrm{g}}\)’s and \(\alpha_{\textrm{e}}\)’s

Estimating \(\alpha_{g}\) and \(\alpha_{e}\)

\[ \hat{\boldsymbol{\alpha}} = \textrm{argmin} \sum_{i = 1}^{n}{(\delta_{i} - \alpha_{g_{i}} - \alpha_{e_{i}})^2}, \]

Linear regression without an intercept

First step: data frame w/ \(\delta, \textrm{g}\) and \(\textrm{e}\)
Must convert \(\textrm{g}\) and \(\textrm{e}\) to factor variables

tmp_df <-
  atbat2024 |>
  dplyr::select(RunValue, GameState, end_events) |>
  dplyr::mutate(
    GameState = factor(GameState),
    end_events = factor(end_events))

Model Fitting

state_event_fit <- lm(RunValue ~ -1 + GameState + end_events, 
                      data = tmp_df)

Access estimates \(\hat{\alpha}_{g}\) and \(\hat{\alpha}_{e}\)’s w/ coef

alpha_hat <- coef(state_event_fit)
c(alpha_hat["GameState2.000"], alpha_hat["end_eventssingle"])

  GameState2.000 end_eventssingle 
       0.3562635        0.1150004

On average, a single with 2 outs and no runners on creates 0.471 run value

Baserunning Credit

Decomposing \(\delta\)

Let \(\hat{\mu}_{i} = \hat{\alpha}_{\textrm{g}_{i}} + \hat{\alpha}_{\textrm{e}_{i}}\)
- Estimated exp. run value for at-bat \(i\) based on starting state and ending event
Let \(\eta_{i} = \delta_{i} - \hat{\mu}_{i}\)

Following Baumer et al. (2015):
- Attribute \(\hat{\mu}_{i}\) to batter
- Divide \(\eta_{i}\) b/w base runners

atbat2024$eta <- state_event_fit$residuals
atbat2024$mu <- state_event_fit$fitted.values

Baserunning Expectations

On a single, we would expect:
- Batter to reach 1st
- Other base runners to advance 1 base
March 20, 2024: Ohtani singled in 0.110 against Padres:
- Ohtani reaches 1st, runner on 1st advances to 2nd
- Runner on 2nd scored
Instead of dividing \(\eta_{i}\) equally, we should give more to runner originally on 2nd

Weighted Allocation

Consider runner on base \(j\) in at-bat \(i\)
\(k_{ij}\): number of bases advanced by runner \(j\) in at-bat \(i\)
\(\textrm{e}_{i}\): ending event of at-bat \(i\)

Let \(\kappa_{ij} = \mathbb{P}(K < k_{ij} \vert \textrm{e}_{i})\)
Assign \(\eta_{i} \times \kappa_{ij}/\sum_{j'}{\kappa_{ij'}}\) to baserunner \(j\)
Why condition on ending event??

Computing \(k_{ij} (Overview)\)

See lecture notes for full code
Wrote functions to determine how many bases each runner (including batter) advanced
Compared starting and ending on_1b, on_2b, and on_3b
Also checked des to see if any player scored or was run out

Computing \(\kappa_{ij}\) (Code)

baserunning <-
  atbat2024 |>
  dplyr::rowwise() |>
  dplyr::mutate(
    mvt_batter = 
      mvt_batter(batter, Outs, bat_score, 
                 end_on_1b, end_on_2b, end_on_3b, 
                 end_Outs, end_bat_score, des),
    mvt_1b = 
      mvt_1b(on_1b, Outs, bat_score, 
             end_on_1b, end_on_2b, end_on_3b,
             end_Outs, end_bat_score, des),
    mvt_2b = 
      mvt_2b(on_2b, Outs, bat_score, 
             end_on_2b, end_on_3b, 
             end_Outs, end_bat_score, des),
    mvt_3b = 
      mvt_3b(on_3b, Outs, bat_score, 
             end_on_3b, end_Outs, end_bat_score, des)) |>
  dplyr::ungroup()

Cumulative Baserunning Probabilities

We have \(k_{ij}\) for all baserunners involved in at-bat \(i\)
Need to compute \(\kappa_{ij} = \mathbb{P}(K < k_{ij} \vert \textrm{e}_{i})\)

br_1b_probs <-
  baserunning |>
  dplyr::filter(!is.na(on_1b)) |>
  dplyr::group_by(end_events) |>
  dplyr::summarize(
    kappa_0 = mean(mvt_1b <= 0, na.rm = TRUE),
    kappa_1 = mean(mvt_1b <= 1, na.rm = TRUE),
    kappa_2 = mean(mvt_1b <= 2, na.rm = TRUE),
    kappa_3 = mean(mvt_1b <= 3, na.rm = TRUE),
    kappa_NA = 0) |>
  tidyr::pivot_longer(cols = tidyr::starts_with("kappa_"),
                      names_to = "mvt_1b",
                      names_prefix = "kappa_",
                      values_to = "kappa_1b") |>
  dplyr::mutate(mvt_1b = ifelse(mvt_1b == "NA", NA, mvt_1b),
                mvt_1b = as.numeric(mvt_1b))

After a single, runner on first advances
- 0 bases w/ prob 2%; 1 base w/ prob 63% (0.647-0.0194)
- 2 bases w/ prob 31%; 3 bases w/ prob 4%

br_1b_probs |> dplyr::filter(end_events == "single")

# A tibble: 5 × 3
  end_events mvt_1b kappa_1b
  <chr>       <dbl>    <dbl>
1 single          0   0.0194
2 single          1   0.647 
3 single          2   0.957 
4 single          3   1     
5 single         NA   0

After a strikeout, runner on first advances
- 0 bases w/ prob 93%; 1 base w/ prob 6.4%
- 2 bases w/ prob 0.6%; 3 bases w/ prob 0%

br_1b_probs |> dplyr::filter(end_events == "strikeout")

# A tibble: 5 × 3
  end_events mvt_1b kappa_1b
  <chr>       <dbl>    <dbl>
1 strikeout       0    0.930
2 strikeout       1    0.994
3 strikeout       2    1.000
4 strikeout       3    1    
5 strikeout      NA    0

Baserunning Weights

Remember: assign \(\eta_{i} \times \frac{\kappa_{ij}}{\sum_{j'}{\kappa_{ij'}}}\) to runner \(j\) in at-bat \(i\)
Add columns for \(\kappa\) to baserunning

baserunning <-
  baserunning |>
  dplyr::inner_join(y = br_batter_probs, by = c("end_events", "mvt_batter")) |>
  dplyr::inner_join(y = br_1b_probs, by = c("end_events", "mvt_1b")) |>
  dplyr::inner_join(y = br_2b_probs, by = c("end_events", "mvt_2b")) |>
  dplyr::inner_join(y = br_3b_probs, by = c("end_events", "mvt_3b")) |>
  dplyr::mutate(
    total_kappa = kappa_batter + kappa_1b + kappa_2b + kappa_3b,
    norm_batter = kappa_batter/total_kappa,
    norm_1b = kappa_1b/total_kappa,
    norm_2b = kappa_2b/total_kappa,
    norm_3b = kappa_3b/total_kappa)

Illustration

Ohtani’s At-Bats
\(\kappa\)’s
Weights

1st, 2nd, 3rd, & 5th at-bat: Ohtani reaches 1st (mvt_batter = 1)
5th at-bat: runner on second scored & should get more credit

# A tibble: 5 × 3
  mvt_batter end_events des                                                     
       <dbl> <chr>      <chr>                                                   
1          1 force_out  Shohei Ohtani grounds into a force out, shortstop Ha-Se…
2          1 single     Shohei Ohtani singles on a sharp line drive to right fi…
3          1 force_out  Shohei Ohtani grounds into a force out, third baseman T…
4          0 field_out  Shohei Ohtani grounds out softly, pitcher Wandy Peralta…
5          1 single     Shohei Ohtani singles on a line drive to left fielder J…

[1] "Shohei Ohtani grounds into a force out, shortstop Ha-Seong Kim to second baseman Xander Bogaerts. Mookie Betts out at 2nd. Shohei Ohtani to 1st."
[2] "Shohei Ohtani grounds into a force out, third baseman Tyler Wade to shortstop Ha-Seong Kim. Mookie Betts out at 2nd. Shohei Ohtani to 1st."      
[3] "Shohei Ohtani singles on a line drive to left fielder José Azocar. Gavin Lux scores. Mookie Betts to 2nd."

1st & 3rd at-bat: runner on 1st forced out at 2nd
- On force outs, runner on 1st advances 0 bases about 95% of the time
5th at-bat: runner scores from 2nd (i.e., advances 2 bases)
- kappa_2b = 1: runner on 2nd always advances \(<= 2\) bases!

# A tibble: 5 × 8
  mvt_1b mvt_2b mvt_3b mvt_batter kappa_1b kappa_2b kappa_3b kappa_batter
   <dbl>  <dbl>  <dbl>      <dbl>    <dbl>    <dbl>    <dbl>        <dbl>
1      0     NA     NA          1    0.955        0        0        0.988
2     NA     NA     NA          1    0            0        0        0.974
3      0     NA     NA          1    0.955        0        0        0.988
4     NA     NA     NA          0    0            0        0        1    
5      1      2     NA          1    0.647        1        0        0.974

1st & 3rd at-bat: runner on first out at 2nd
- mvt_1b = 0 and norm_1b is less than norm_batter
5th at-bat: runner scores from 2nd & should get more credit
- norm_2b is larger than norm_1b and norm_batter

# A tibble: 5 × 4
  norm_1b norm_2b norm_3b norm_batter
    <dbl>   <dbl>   <dbl>       <dbl>
1   0.491   0           0       0.509
2   0       0           0       1    
3   0.491   0           0       0.509
4   0       0           0       1    
5   0.247   0.382       0       0.372

Baserunning Run Value

\(\eta_{i} \times \frac{\kappa_{ij}}{\sum_{j'}{\kappa_{ij'}}}\): run value created by baserunner \(j\) in at-bat \(i\)
For each player and base, sum contributions
\(\textrm{RAA}_{1b}\): run value created by baserunning from 1st base

raa_1b <-
  baserunning |>
  dplyr::filter(!is.na(on_1b)) |>
  dplyr::mutate(RAA_1b = norm_1b * eta) |>
  dplyr::group_by(on_1b) |>
  dplyr::summarise(RAA_1b = sum(RAA_1b)) |>
  dplyr::rename(key_mlbam = on_1b)

Computing \(\textrm{RAA}^{(\textrm{br})}\)

Summing over the four baserunning positions gives us \(\textrm{RAA}^{(\textrm{br})}\)
Baumer et al. (2015): run value created above average through baserunning

Code
Top-10

raa_br <-
  raa_batter |>
  dplyr::full_join(y = raa_1b, by = "key_mlbam") |>
  dplyr::full_join(y = raa_2b, by = "key_mlbam") |>
  dplyr::full_join(y = raa_3b, by = "key_mlbam") |>
  tidyr::replace_na(list(RAA_batter = 0, RAA_1b = 0, RAA_2b = 0, RAA_3b = 0)) |>
  dplyr::mutate(RAA_br = RAA_batter + RAA_1b + RAA_2b + RAA_3b) |>
  dplyr::inner_join(y = player2024_lookup, by = "key_mlbam") |>
  dplyr::select(Name, key_mlbam, RAA_br, RAA_batter, RAA_1b, RAA_2b, RAA_3b)

# A tibble: 10 × 6
   Name                RAA_br RAA_batter RAA_1b RAA_2b  RAA_3b
   <chr>                <dbl>      <dbl>  <dbl>  <dbl>   <dbl>
 1 Maikel Garcia        10.8    -0.126    5.80   2.42   2.68  
 2 Corbin Carroll        9.64    0.00504  5.89   2.64   1.11  
 3 Bobby Witt            9.13    0        4.51   3.63   0.986 
 4 Jake McCarthy         8.47    5.70     0      0.937  1.83  
 5 Dansby Swanson        8.27    3.83     3.28   0.551  0.619 
 6 Marcus Semien         8.00    2.87     4.88  -0.201  0.449 
 7 Pete Crow-Armstrong   7.55    5.06     3.33   1.47  -2.31  
 8 Javier Baez           7.27    8.17    -0.650 -0.625  0.378 
 9 Jose Ramirez          7.23   -2.03     3.55   2.53   3.18  
10 Anthony Volpe         7.22    3.42     2.33   1.57  -0.0997

Batting Run Value

\(\textrm{RAA}^{(b)}\)

Remember, we decomposed \(\delta_{i} = \hat{\mu}_{i} + \eta_{i}\)
Distributed \(\eta_{i}\) among all base runners involve in at-bat \(i\)

Code
Top-10

batting <- 
  atbat2024 |>
  dplyr::select(batter, mu) |>
  dplyr::rename(key_mlbam = batter)

raa_b <-
  batting |>
  dplyr::group_by(key_mlbam) |>
  dplyr::summarise(RAA_b = sum(mu)) |>
  dplyr::left_join(y = player2024_lookup, by = "key_mlbam") |>
  dplyr::select(Name, key_mlbam, RAA_b)

# A tibble: 10 × 3
   Name              key_mlbam RAA_b
   <chr>                 <int> <dbl>
 1 Aaron Judge          592450  85.1
 2 Shohei Ohtani        660271  70.8
 3 Juan Soto            665742  67.7
 4 Bobby Witt           677951  61.1
 5 Vladimir Guerrero    665489  47.3
 6 Gunnar Henderson     683002  44.8
 7 Brent Rooker         667670  44.3
 8 Yordan Alvarez       670541  42.1
 9 Marcell Ozuna        542303  36.3
10 Ketel Marte          606466  35.2

Looking Ahead

Divided run value \(\delta_{i}\) b/w batting & baserunning
Next time: divide \(-\delta_{i}\) b/w pitching & fielding