STAT 479: Lecture 6

Run Expectancy

Motivation

  • March 20, 2024 Dodgers vs Padres game
  • Shohei Ohtani hit two singles
    • 3rd inning, 2 outs and no runners on base, Ohtani singled into right field.
    • 8th inning, with 1 out and runners on first and second base, Ohtani singled into left center field, driving in one run
  • Which single was more valuable?
  • Second single scored a run
  • But first single put runner in scoring position
  • This lecture: a “currency” for evaluating plays that accounts for
    • Actual runs scored
    • Potential to score more runs
  • Lectures 7 & 8: apportioning offensive & defensive credit + WAR

History of Tracking Data in Baseball

  • 2006: Sportvision debuts camera system for tracking pitch trajectory
  • 2008: Sportvision releases PITCHf/x data to MLB to power GameDay app
  • 2008-2012(?): people realize GameDay API is publicly accessible & start scraping
  • 2017: PITCHf/x phased out in favor of radar-based Trackman
    • Trackman originally developed for golf
  • Statcast: ball & player tracking
  • Made available through BaseballSavant (also scrapable)

Statcast Data in R

  • Bill Petti’s baseballR package: download development version from GitHub
devtools::install_github(repo = "BillPetti/baseballr")
  • See course website for function annual_statcast_quary()
    • Scrapes whole season’s Statcast data
    • Takes 30-45 minutes per season: run once & save
    • Modifies Petti’s original code to account for API changes (# fields, names of variables, etc.)
raw_statcast2024 <- annual_statcast_query(2024)
save(raw_statcast2024, file = "raw_statcast2024.RData")

Exploring StatCast Data

Extent of Data

  • annual_statcast_query() loops over weeks in a year and picks up games from
    • Exhibition & Spring Training
    • Post-season: wildcard (F) and Divisional, League championship, & World series
    • Regular season
table(raw_statcast2024$game_type, useNA = 'always')

     D      F      L      R      S      W   <NA> 
  5182   2488   3540 695136  77056   1576      0 

Contextual Variables

  • Each row corresponds to a pitch
  • Numeric id for games (game_pk), at-bats w/in games (atbat_num), & pitches w/in at-bats (pitch_num)
  • inning and inning_top_bot
  • balls, strikes, outs_when_up
  • batter, on_1b, on_2b, on_3b: offensive player IDs
  • pitcher, fielder_2, …, fielder_9: defensive player IDs

Extracting Regular Season Data

  • We’ll focus on regular season games
  • Also remove pitches w/ obvious mistakes
statcast2024 <-
  raw_statcast2024 |> 
  dplyr::filter(game_type == "R") |>
  dplyr::filter(
    strikes >= 0 & strikes < 3 & 
      balls >= 0 & balls < 4 & 
      outs_when_up >= 0 & outs_when_up < 3) |>
  dplyr::arrange(game_date, game_pk, at_bat_number, pitch_number)

Pitch Outcome Variables

  • type: Ball, Strike, contact (X)
  • description: pitch-level outcome
                         type
description                    B      S      X
  ball                    231032      0      0
  blocked_ball             14717      0      0
  bunt_foul_tip                0     15      0
  called_strike                0 113912      0
  foul                         0 126012      0
  foul_bunt                    0   1208      0
  foul_tip                     0   7218      0
  hit_by_pitch              1979      0      0
  hit_into_play                0      7 121744
  missed_bunt                  0    196      0
  pitchout                    52      0      0
  swinging_strike              0  73208      1
  swinging_strike_blocked      0   3834      0

Data Snapshot

# A tibble: 20 × 6
   at_bat_number pitch_number balls strikes type  des                           
           <int>        <int> <int>   <int> <chr> <chr>                         
 1             1            1     0       0 B     Mookie Betts walks.           
 2             1            2     1       0 S     Mookie Betts walks.           
 3             1            3     1       1 B     Mookie Betts walks.           
 4             1            4     2       1 B     Mookie Betts walks.           
 5             2            1     0       0 B     Shohei Ohtani grounds into a …
 6             2            2     1       0 S     Shohei Ohtani grounds into a …
 7             2            3     1       1 B     Shohei Ohtani grounds into a …
 8             2            4     2       1 X     Shohei Ohtani grounds into a …
 9             3            1     0       0 S     Freddie Freeman called out on…
10             3            2     0       1 S     Freddie Freeman called out on…
11             3            3     0       2 B     Freddie Freeman called out on…
12             3            4     1       2 S     Freddie Freeman called out on…
13             4            1     0       0 B     Will Smith flies out to left …
14             4            2     1       0 X     Will Smith flies out to left …
15             5            1     0       0 B     Xander Bogaerts flies out to …
16             5            2     1       0 S     Xander Bogaerts flies out to …
17             5            3     1       1 B     Xander Bogaerts flies out to …
18             5            4     2       1 S     Xander Bogaerts flies out to …
19             5            5     2       2 B     Xander Bogaerts flies out to …
20             5            6     3       2 X     Xander Bogaerts flies out to …
# A tibble: 20 × 6
   batter outs_when_up  on_1b on_2b on_3b des                                   
    <int>        <int>  <dbl> <dbl> <dbl> <chr>                                 
 1 605141            0     NA    NA    NA Mookie Betts walks.                   
 2 605141            0     NA    NA    NA Mookie Betts walks.                   
 3 605141            0     NA    NA    NA Mookie Betts walks.                   
 4 605141            0     NA    NA    NA Mookie Betts walks.                   
 5 660271            0 605141    NA    NA Shohei Ohtani grounds into a force ou…
 6 660271            0 605141    NA    NA Shohei Ohtani grounds into a force ou…
 7 660271            0 605141    NA    NA Shohei Ohtani grounds into a force ou…
 8 660271            0 605141    NA    NA Shohei Ohtani grounds into a force ou…
 9 518692            1 660271    NA    NA Freddie Freeman called out on strikes.
10 518692            1 660271    NA    NA Freddie Freeman called out on strikes.
11 518692            1 660271    NA    NA Freddie Freeman called out on strikes.
12 518692            1 660271    NA    NA Freddie Freeman called out on strikes.
13 669257            2 660271    NA    NA Will Smith flies out to left fielder …
14 669257            2 660271    NA    NA Will Smith flies out to left fielder …
15 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…
16 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…
17 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…
18 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…
19 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…
20 593428            0     NA    NA    NA Xander Bogaerts flies out to right fi…

Player IDs

  • Statcast uses MLB Advanced Media ID number for players
  • Will be useful to look up player names using IDs (and vice versa)
  • Chadwick Register maintains a database
chadwick_players <- baseballr::chadwick_player_lu()
save(chadwick_players, file = "chadwick_players.RData")
player2024_id <- 
  unique(
    c(statcast2024$batter, statcast2024$pitcher,
      statcast2024$on_1b, statcast2024$on_2b, statcast2024$on_3b,
      statcast2024$fielder_2, statcast2024$fielder_3,
      statcast2024$fielder_3, statcast2024$fielder_4,
      statcast2024$fielder_5, statcast2024$fielder_6,
      statcast2024$fielder_7, statcast2024$fielder_8,
      statcast2024$fielder_9))
player2024_lookup <-
  chadwick_players |>
  dplyr::filter(!is.na(key_mlbam) & key_mlbam %in% player2024_id) |>
  dplyr::mutate(
    FullName = paste(name_first, name_last), 
    Name = stringi::stri_trans_general(FullName, "Latin-ASCII")) 
save(player2024_lookup, file = "player2024_lookup.RData")
player2024_lookup |> 
  dplyr::filter(Name == "Shohei Ohtani") |>
  dplyr::pull(key_mlbam)
[1] 660271
player2024_lookup |>
  dplyr::filter(key_mlbam == 605141) |>
  dplyr::pull(Name)
[1] "Mookie Betts"

Batting Order

  • baseballr::mlb_batting_orders(): retrieves batting order for every game
baseballr::mlb_batting_orders(game_pk = 745444)
# A tibble: 18 × 8
       id fullName         abbreviation batting_order batting_position_num team 
    <int> <chr>            <chr>        <chr>         <chr>                <chr>
 1 605141 Mookie Betts     SS           1             0                    away 
 2 660271 Shohei Ohtani    DH           2             0                    away 
 3 518692 Freddie Freeman  1B           3             0                    away 
 4 669257 Will Smith       C            4             0                    away 
 5 571970 Max Muncy        3B           5             0                    away 
 6 606192 Teoscar Hernánd… RF           6             0                    away 
 7 681546 James Outman     CF           7             0                    away 
 8 518792 Jason Heyward    RF           8             0                    away 
 9 666158 Gavin Lux        2B           9             0                    away 
10 593428 Xander Bogaerts  2B           1             0                    home 
11 665487 Fernando Tatis … RF           2             0                    home 
12 630105 Jake Cronenworth 1B           3             0                    home 
13 592518 Manny Machado    DH           4             0                    home 
14 673490 Ha-Seong Kim     SS           5             0                    home 
15 595777 Jurickson Profar LF           6             0                    home 
16 669134 Luis Campusano   C            7             0                    home 
17 642180 Tyler Wade       3B           8             0                    home 
18 701538 Jackson Merrill  CF           9             0                    home 
# ℹ 2 more variables: teamName <chr>, teamID <int>

Player Positions

  • Lecture 7: compare batter’s performance to position-level average
  • Scrape batting orders & determine each player’s most frequent position
get_lineup <- function(game_pk){
  lineup <- baseballr::mlb_batting_orders(game_pk = game_pk)
  lineup <-
    lineup |>
    dplyr::mutate(game_pk = game_pk) |>
    dplyr::rename(key_mlbam = id, position = abbreviation) |>
    dplyr::select(game_pk, key_mlbam, position)
  return(lineup)
}
all_lineups <- list()
unik_game_pk <- unique(statcast2024$game_pk)
for(i in 1:length(unik_game_pk)){
  all_lineups[[i]] <- get_lineup(game_pk = unik_game_pk[i])
}
poss_get_lineup <- purrr::possibly(.f = get_lineup, otherwise = NULL) 
unik_game_pk <- unique(statcast2024$game_pk)

block_starts <- seq(1, length(unik_game_pk), by = 500)
block_ends <- c(block_starts[-1], length(unik_game_pk))

all_lineups <- list()
for(b in 1:5){
  tmp <-
    purrr::map(.x = unik_game_pk[block_starts[b]:block_ends[b]], 
               .f = poss_get_lineup, 
               .progress = TRUE)
  all_lineups <- c(all_lineups, tmp)
}

lineups2024 <- 
  dplyr::bind_rows(all_lineups) |>
  unique()
save(lineups2024, file = "lineups2024.RData")
positions2024 <-
  lineups2024 |>
  dplyr::group_by(key_mlbam, position) |>
  dplyr::summarise(n = dplyr::n()) |>
  dplyr::slice_max(order_by = n, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(key_mlbam, position)
save(positions2024, file = "positions2024.RData")

Baserunner Configuration

  • on_1b, on_2b, and on_3b: tells us who is on base
  • Useful to encode configuration w/ 3 binary digits
    • 1st digit for first base, 2nd for second base, 3rd for third base
    • 101 for runners on 1st and 3rd
  • Also useful to rename outs_when_up
statcast2024 <-
  statcast2024 |>
  dplyr::mutate(
    BaseRunner = 
      paste0(1*(!is.na(on_1b)),1*(!is.na(on_2b)),1*(!is.na(on_3b)))) |>
  dplyr::rename(Outs = outs_when_up)

Expected Runs

Definition

  • For each pitch let \(R\) be numbers of runs scored in remainder of half-inning
  • \(\textrm{o} \in \{0,1,2\}\) be number of outs
  • \(\textbf{br} \in \{"000", "100", "010", "001", "110", "101", "011", "111"\}\)
  • \(\rho(\textrm{o}, \textrm{br}) = \mathbb{E}[R \vert \textrm{o}, \textrm{br}]\)
  • Avg. number of runs team expects to score based on current game-state

Runs Scored in Half-Inning

  • Suppose there are \(n_{a}\) pitces in at-bat \(a\)
  • \(R_{i,a}\): number of runs scored in half-inning after pitch \(i\) in at-bat \(a\)
  • Step 1: append a column to statcast2024 containing \(R_{i,a}\) values
  • Will utilize the following Statcast variables
    • bat_score: batting team score before the pitch is thrown
    • post_bat_score: batting team score after pitch is thrown

Illustration: Dodger’s 8th Inning

Play-by-play

Play-by-play
rbind(bat_score = dodgers_inning$bat_score, post_bat_score = dodgers_inning$post_bat_score)
               [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
bat_score         1    1    1    1    1    1    1    1    1     1     1     1
post_bat_score    1    1    1    1    1    1    1    1    1     1     1     1
               [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
bat_score          1     1     2     3     3     3     4     5     5     5
post_bat_score     1     2     3     3     3     4     5     5     5     5
               [,23] [,24] [,25]
bat_score          5     5     5
post_bat_score     5     5     5
dodgers_inning$des[c(14,15, 18, 19)]
[1] "Enrique Hernández out on a sacrifice fly to left fielder José Azocar. Max Muncy scores."                                                                                             
[2] "Gavin Lux reaches on a fielder's choice, fielded by first baseman Jake Cronenworth. Teoscar Hernández scores. James Outman to 2nd. Fielding error by first baseman Jake Cronenworth."
[3] "Mookie Betts singles on a ground ball to left fielder José Azocar. James Outman scores. Gavin Lux to 2nd."                                                                           
[4] "Shohei Ohtani singles on a line drive to left fielder José Azocar. Gavin Lux scores. Mookie Betts to 2nd."                                                                           
dplyr::last(dodgers_inning$post_bat_score) - dodgers_inning$bat_score
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 0 0 0 0 0 0

Computing All \(R_{i,a}\)’s

  • Split data table based on half-inning
    • group_by(game_pk, inning_number, inning_topbot)
  • Get the last value of post_bat_score w/in half-inning
  • \(R_{i,a}\): dplyr::last(post_bat_score) - bat_score
statcast2024 <-
  statcast2024 |>
  dplyr::group_by(game_pk, inning, inning_topbot) |> 
  dplyr::arrange(at_bat_number, pitch_number) |> 
  dplyr::mutate(RunsRemaining = dplyr::last(post_bat_score) - bat_score) |>
  dplyr::ungroup()

Computing Expected Runs

  • Recall definition: \(\rho(\textrm{o}, \textrm{br}) = \mathbb{E}[R \vert \textrm{o}, \textrm{br}]\)
  • Group pitches by Outs and BaseRunner and average RunsRemaining
expected_runs <-
  statcast2024 |>
  dplyr::filter(pitch_number == 1) |>
  dplyr::select(Outs, BaseRunner, RunsRemaining) |>
  dplyr::group_by(Outs, BaseRunner) |>
  dplyr::summarize(rho = mean(RunsRemaining), .groups = "drop")
  • Useful to create 25th state for end of inning
expected_runs <-
  expected_runs |>
  tibble::add_row(Outs=3, BaseRunner="000", rho = 0)
# A tibble: 8 × 4
  BaseRunner `Outs: 0` `Outs: 1` `Outs: 2`
  <chr>          <dbl>     <dbl>     <dbl>
1 000            0.488     0.262    0.0980
2 001            1.43      0.972    0.352 
3 010            1.07      0.672    0.347 
4 011            2.03      1.44     0.612 
5 100            0.897     0.529    0.228 
6 101            1.90      1.22     0.502 
7 110            1.49      0.926    0.449 
8 111            2.31      1.58     0.815 
  • First single: Ohtani’s team increased run expectance by 0.12
    • Starting Outs=2, BaseRunner = '000': \(\rho \approx 0.1\)
    • Ending: Outs=2, BaseRunner= '100': \(\rho \approx 0.22\)

Run Value

  • \(\textrm{RunsScored}\): number of runs scored in each at-bat \[ \textrm{RunValue} = \textrm{RunsScored} + \rho(\textrm{o}_{\text{end}}, \textrm{br}_{\text{end}}) - \rho(\textrm{o}_{\text{start}}, \textrm{br}_{\text{start}}) \]
  • Run value tracks actual number of runs scored and change in expectancy
  • To compute \(\textrm{RunValue}\) we must
    1. Compute \(\textrm{RunsScored}\)
    2. Determine starting and ending game-states (i.e., Outs and BaseRunner)

Computing \(\textrm{RunsScored}\)

  • Computing \(\textrm{RunsScored}\) involves
    1. Sort pitches by at-bat number & pitch-number
    2. Subtract first bat_score from last post_bat_score in each at-bat
statcast2024 <-
  statcast2024 |>
  dplyr::group_by(game_pk, at_bat_number) |> 
  dplyr::arrange(pitch_number) |> 
  dplyr::mutate(RunsScored = dplyr::last(post_bat_score) - dplyr::first(bat_score)) |> 
  dplyr::ungroup() |>
  dplyr::arrange(game_date, game_pk, at_bat_number, pitch_number)

Starting & Ending States

  • Starting state of pitch \(i+1\) is ending state of pitch \(i\)
  • Use dplyr::lead() to get next value (next_Outs, next_BaseRuner)
runValue2024 <- 
  statcast2024 |>
  dplyr::group_by(game_pk, inning, inning_topbot) |> 
  dplyr::arrange(at_bat_number, pitch_number) |>
  dplyr::mutate(
    next_Outs = dplyr::lead(Outs), 
    next_BaseRunner = dplyr::lead(BaseRunner)) |>
  dplyr::ungroup() |>
  • Last value of next_Outs and next_BaseRunner gives at-bat’s ending state
  • Compute this w/in every at-bat in every game
  dplyr::group_by(game_pk, at_bat_number) |>
  dplyr::arrange(pitch_number) |>
  dplyr::mutate(
    end_Outs = dplyr::last(next_Outs), 
    end_BaseRunner = dplyr::last(next_BaseRunner)) |> 
  dplyr::ungroup() |>
  • Must convert from pitch- to at-bat-level
  • First pitch in at-bat gives starting state
  dplyr::arrange(game_date, game_pk, at_bat_number, pitch_number) |>
  dplyr::filter(pitch_number == 1) |> 
  dplyr::select(
    game_pk, at_bat_number, 
    inning, inning_topbot, 
    Outs, BaseRunner, 
    RunsScored, RunsRemaining, 
    end_Outs, end_BaseRunner)

Computing \(\textrm{RunValue}\)

  • dplyr::lead() produces NA’s at the end of half-inning
runValue2024 <-
  runValue2024 |>
  dplyr::mutate(
    end_Outs = ifelse(is.na(end_Outs), 3, end_Outs),
    end_BaseRunner = ifelse(is.na(end_BaseRunner), '000', end_BaseRunner))
end_expected_runs <- 
  expected_runs |>
  dplyr::rename(
    end_Outs = Outs,
    end_BaseRunner = BaseRunner,
    end_rho = rho)

runValue2024 <-
  runValue2024 |>
  dplyr::left_join(y = expected_runs, by = c("Outs", "BaseRunner")) |>
  dplyr::left_join(y = end_expected_runs, by = c("end_Outs", "end_BaseRunner")) |>
  dplyr::mutate(RunValue = RunsScored + end_rho - rho) |>
  dplyr::select(game_pk, at_bat_number, RunValue)

Assessing Batter Performance

Ohtani’s Single-Game Performance

  • Extract every Ohtani at-bat from game against Padres
  • Append run value by joining using game_pk and at_bat_number
ohtani_id <- 
  player2024_lookup |>
  dplyr::filter(FullName == "Shohei Ohtani") |>
  dplyr::pull(key_mlbam)

ohtani_ab <-
  statcast2024 |>
  dplyr::filter(game_pk == 745444) |>
  dplyr::filter(pitch_number == 1 & batter == ohtani_id) |>
  dplyr::select(game_pk, at_bat_number, inning, des) |>
  dplyr::inner_join(y = runValue2024, by = c("game_pk", "at_bat_number")) |>
  dplyr::select(inning, RunValue, des)
ohtani_ab
# A tibble: 5 × 3
  inning RunValue des                                                           
   <int>    <dbl> <chr>                                                         
1      1   -0.367 Shohei Ohtani grounds into a force out, shortstop Ha-Seong Ki…
2      3    0.130 Shohei Ohtani singles on a sharp line drive to right fielder …
3      5   -0.367 Shohei Ohtani grounds into a force out, third baseman Tyler W…
4      7   -0.164 Shohei Ohtani grounds out softly, pitcher Wandy Peralta to fi…
5      8    1     Shohei Ohtani singles on a line drive to left fielder José Az…

Season Leaders

  • Use batter in statcast2024 to look up player name in player2024_lookup
  • Problem: player2024_lookup doesn’t have column batter
  • Solution: temporary look-up table renaming key_mlbam to batter
tmp_lookup <-
  player2024_lookup |>
  dplyr::select(key_mlbam, Name) |>
  dplyr::rename(batter = key_mlbam)
  • For each player, sum RunValue across all at-bats (RE24)
re24 <-
  statcast2024 |> 
  dplyr::filter(pitch_number == 1) |>
  dplyr::select(game_pk, at_bat_number, batter) |>
  dplyr::inner_join(y = runValue2024, by = c("game_pk", "at_bat_number")) |>
  dplyr::group_by(batter) |>
  dplyr::summarise(RE24 = sum(RunValue),N = dplyr::n()) |>
  dplyr::inner_join(y = tmp_lookup, by = "batter") |>
  dplyr::select(Name, RE24, N)
  • Aaron Judge appears to have created the most run value for his team in 2024
re24 |> 
  dplyr::arrange(dplyr::desc(RE24)) |>
  dplyr::slice_head(n=10)
# A tibble: 10 × 3
   Name               RE24     N
   <chr>             <dbl> <int>
 1 Aaron Judge        89.0   675
 2 Juan Soto          74.4   693
 3 Shohei Ohtani      73.1   708
 4 Bobby Witt         65.9   694
 5 Brent Rooker       47.9   599
 6 Vladimir Guerrero  45.0   671
 7 Ketel Marte        41.2   562
 8 Kyle Schwarber     40.9   672
 9 Joc Pederson       39.2   433
10 Jose Ramirez       39.1   657

Looking Ahead

  • Suppose batter hits a single and baserunner advances from 1st to 3rd
    • Run Value reflects change in run expectancy. BaseRunner: '100' \(\rightarrow\) '101')
    • Is it fair to give batter all the credit for creating run value?
  • Lecture 7: Dividing run value b/w baserunner and batter
  • Lecture 8: Divide negative run value b/w pitcher and fielder

Projects

  • Project Check-in: by Friday 26 September email me a short overview of your project
    • Precise problem statement & overview of data and analysis plan
    • Happy to help brainstorm & narrow down analysis in Office Hours (or by appt.)
    • Inspiration: exercises in lecture notes & project information page
  • Office Hours this week:
    • Tuesday (today): 4pm - 5:45pm (MH 5586)
    • Wednesday: 3pm-4pm (MH5586)
    • Friday: 1pm - 3pm (MH closed after 3pm)