STAT 479 Lecture 1

Course Overview

Logistics

Lectures & Office Hours

  • Lectures: Tuesdays & Thursday 11am-12:15pm (Morgridge Hall 1524)
  • Instructor: Sameer Deshpande
    • Mondays: 11am - 12pm (Morgridge Hall 5586)
    • Wednesdays: 3pm - 4pm (Morgridge Hall 5586)
    • Fridays: 3pm - 4:30pm (Morgridge Hall 5618)
  • TA: Zhexuan Liu
    • Tuesdays & Thursdays 9:15am-10:45am (Morgridge Hall 2515)

What This Course is NOT

  • NOT a statistical methods course w/ sports applications
  • Methods introduced as needed
    • But will not cover all technical details
    • I’ll point to relevant resources & classes where relevant
    • Main goal: answering substantive sports questions
  • NOT a course on sports betting or fantasy sports
    • Some material may be relevant …
    • … but we won’t explicitly discuss these topics
  • NOT a way to discuss last night’s game for credit
    • This is a serious statistics & data science course
  • Goal: practice skills needed to work w/ sports data (e.g. for a team)
    • Course design reflects input from leaders in the industry

Course Learning Outcomes

  1. Implement appropriate statistical methods to assess player and team performance
  2. Work with play-by-play and high-resolution tracking data
  3. Provide constructive and actionable feedback on your peers’ analytic reports
  4. Build a personal portfolio of sports data analyses
  • Asking and answering substantive sports questions w/ data
    • We’ll go well beyond “who’s the best at…”

What is Sports Analytics?

Statistics vs Analytics

  • Sports statistics refer to counts and rates that summarize performance:
    • Number of rebounds, wins, games played
    • Free throw percentage, batting average
  • Analytics refers to the use of statistical modeling to help gain competitive advantage
  • Analytics makes use of basic statistics & creates new ones
  • Analytics uses data to answer substantive sports questions
  • Many questions can framed in terms of prediction

“Best”NBA Shooting Performances

  • Who is the best shooter in NBA history?
    • What do we mean by “best”?
    • It’s not clear that this can be answered w/ data
  • A more precise question: which player made the most shots in a single season?
    • Can be answered by data
    • We’ll assess whether “best” = “made most shots” is satisfactory later
  • Data: box scores from each game since 2002-03
  • Available from the hoopR package
raw_box <- hoopR::load_nba_player_box(seasons = 2002:(hoopR::most_recent_nba_season()))

Data Snapshot

 [1] "game_id"                           "season"                           
 [3] "season_type"                       "game_date"                        
 [5] "game_date_time"                    "athlete_id"                       
 [7] "athlete_display_name"              "team_id"                          
 [9] "team_name"                         "team_location"                    
[11] "team_short_display_name"           "minutes"                          
[13] "field_goals_made"                  "field_goals_attempted"            
[15] "three_point_field_goals_made"      "three_point_field_goals_attempted"
[17] "free_throws_made"                  "free_throws_attempted"            
[19] "offensive_rebounds"                "defensive_rebounds"               
[21] "rebounds"                          "assists"                          
[23] "steals"                            "blocks"                           
[25] "turnovers"                         "fouls"                            
[27] "plus_minus"                        "points"                           
[29] "starter"                           "ejected"                          
[31] "did_not_play"                      "active"                           
[33] "athlete_jersey"                    "athlete_short_name"               
[35] "athlete_headshot_href"             "athlete_position_name"            
[37] "athlete_position_abbreviation"     "team_display_name"                
[39] "team_uid"                          "team_slug"                        
[41] "team_logo"                         "team_abbreviation"                
[43] "team_color"                        "team_alternate_color"             
[45] "home_away"                         "team_winner"                      
[47] "team_score"                        "opponent_team_id"                 
[49] "opponent_team_name"                "opponent_team_location"           
[51] "opponent_team_display_name"        "opponent_team_abbreviation"       
[53] "opponent_team_logo"                "opponent_team_color"              
[55] "opponent_team_alternate_color"     "opponent_team_score"              
[57] "reason"                           
raw_box |> dplyr::filter(game_date == "2011-06-12") |>
  dplyr::select(athlete_display_name, 
         field_goals_made, field_goals_attempted)
# A tibble: 30 × 3
   athlete_display_name field_goals_made field_goals_attempted
   <chr>                           <int>                 <int>
 1 Dirk Nowitzki                       9                    27
 2 Tyson Chandler                      2                     4
 3 Jason Kidd                          2                     4
 4 Shawn Marion                        4                    10
 5 J.J. Barea                          7                    12
 6 Brian Cardinal                      1                     1
 7 Caron Butler                       NA                    NA
 8 Ian Mahinmi                         2                     3
 9 Rodrigue Beaubois                  NA                    NA
10 DeShawn Stevenson                   3                     5
# ℹ 20 more rows

Extract Regular Season Data

allstar_dates <- lubridate::date(c("2002-02-10", "2003-02-09", 
    "2004-02-15","2005-02-20", "2006-02-19", "2007-02-18", 
    "2008-02-17", "2009-02-15", "2010-02-14","2011-02-20", 
    "2012-02-26", "2013-02-17", "2014-02-16", "2015-02-15", 
    "2016-02-14","2017-02-19", "2018-02-18", "2019-02-17",
    "2020-02-16", "2021-03-07", "2022-02-20","2023-02-19", 
    "2024-02-18", "2025-02-16"))
reg_box <- raw_box |>
  dplyr::filter(season_type == 2 & !did_not_play & !game_date %in% allstar_dates)

Preprocessing

  • Rename columns & set defaults for players with no minutes
reg_box <-
  reg_box |>
  dplyr::rename(
    Player = athlete_display_name, 
    FGM = field_goals_made, FGA = field_goals_attempted,
    TPM = three_point_field_goals_made,TPA = three_point_field_goals_attempted,
    FTM = free_throws_made, FTA = free_throws_attempted) |>
  dplyr::mutate(
    FGM = ifelse(is.na(minutes), 0, FGM), FGA = ifelse(is.na(minutes), 0, FGA),
    TPM = ifelse(is.na(minutes), 0, TPM),TPA = ifelse(is.na(minutes), 0, TPA),
    FTM = ifelse(is.na(minutes), 0, FTM),FTA = ifelse(is.na(minutes), 0, FTA)) |>
  tidyr::replace_na(list(minutes = 0))         

Season Totals

season_box <-
  reg_box |>
  dplyr::group_by(Player, season) |>
  dplyr::summarise(FGM = sum(FGM),FGA = sum(FGA),
    TPM = sum(TPM),TPA = sum(TPA),FTM = sum(FTM),FTA = sum(FTA),
    minutes = sum(minutes), n_games = dplyr::n(),.groups = "drop")
season_box |> dplyr::filter(Player == "Dirk Nowitzki") 
# A tibble: 18 × 10
   Player        season   FGM   FGA   TPM   TPA   FTM   FTA minutes n_games
   <chr>          <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <int>
 1 Dirk Nowitzki   2002   600  1258   139   350   440   516    2891      76
 2 Dirk Nowitzki   2003   690  1489   148   390   483   548    3117      80
 3 Dirk Nowitzki   2004   605  1310    99   290   371   423    2915      77
 4 Dirk Nowitzki   2005   663  1445    91   228   615   708    3020      78
 5 Dirk Nowitzki   2006   751  1564   110   271   539   598    3086      81
 6 Dirk Nowitzki   2007   673  1341    72   173   498   551    2819      80
 7 Dirk Nowitzki   2008   630  1314    79   220   478   544    2769      81
 8 Dirk Nowitzki   2009   774  1616    61   170   485   545    3051      82
 9 Dirk Nowitzki   2010   720  1496    51   121   536   586    3041      82
10 Dirk Nowitzki   2011   610  1179    66   168   395   443    2505      82
11 Dirk Nowitzki   2012   473  1034    78   212   318   355    2078      66
12 Dirk Nowitzki   2013   332   707    63   151   164   191    1628      52
13 Dirk Nowitzki   2014   633  1273   131   329   338   376    2625      80
14 Dirk Nowitzki   2015   487  1062   104   274   255   289    2285      77
15 Dirk Nowitzki   2016   498  1112   126   342   250   280    2362      75
16 Dirk Nowitzki   2017   296   678    79   209    98   112    1421      54
17 Dirk Nowitzki   2018   346   758   138   337    97   108    1901      77
18 Dirk Nowitzki   2019   135   376    64   205    39    50     794      51

Who Made the Most Shots?

season_box |>
  dplyr::arrange(dplyr::desc(FGM)) |>
  dplyr::select(Player, season, FGM, FGA, minutes, n_games) |> 
  dplyr::slice_head(n=5)
# A tibble: 5 × 6
  Player                  season   FGM   FGA minutes n_games
  <chr>                    <int> <dbl> <dbl>   <dbl>   <int>
1 Kobe Bryant               2006   949  2109    3184      78
2 LeBron James              2006   875  1823    3361      82
3 Kobe Bryant               2003   868  1924    3401      82
4 Shai Gilgeous-Alexander   2025   868  1680    2633      77
5 LeBron James              2018   857  1580    3024      82
  • Is Kobe’s 2002-03 really the same as SGA’s 2024-25???
  • Arguably “best” should account for efficiency

Field Goal Percentage

season_box <-
  season_box |>
  dplyr::mutate(FGP = ifelse(FGA > 0, FGM/FGA, NA_real_))
season_box |> 
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::slice_head(n=5) |>
  dplyr::select(Player, season, FGP)
# A tibble: 5 × 3
  Player           season   FGP
  <chr>             <int> <dbl>
1 Ahmad Caver        2022     1
2 Alondes Williams   2025     1
3 Andris Biedrins    2014     1
4 Anthony Brown      2018     1
5 Braxton Key        2023     1
season_box |> 
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::slice_head(n=5) |>
  dplyr::select(Player, season, FGP, FGA)
# A tibble: 5 × 4
  Player           season   FGP   FGA
  <chr>             <int> <dbl> <dbl>
1 Ahmad Caver        2022     1     1
2 Alondes Williams   2025     1     2
3 Andris Biedrins    2014     1     1
4 Anthony Brown      2018     1     1
5 Braxton Key        2023     1     1
season_box |> 
  dplyr::filter(FGA >= 400) |> 
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::select(Player, season, FGP, FGA) |>
  dplyr::slice_head(n = 5)
# A tibble: 5 × 4
  Player         season   FGP   FGA
  <chr>           <int> <dbl> <dbl>
1 Daniel Gafford   2024 0.725   480
2 Walker Kessler   2023 0.720   414
3 DeAndre Jordan   2017 0.714   577
4 Rudy Gobert      2022 0.713   508
5 DeAndre Jordan   2015 0.710   534

Effective Field Goal Percentage

  • FGP is arguably a better measure of skill than FGM but
    • Extreme values for players w/ few attempts
    • Doesn’t distinguish 2- and 3-point shots
  • \(\textrm{eFGP} = (\textrm{FGM} + 0.5 \times \textrm{TPM})/\textrm{FGA}\)
season_box <- 
  season_box |> 
  dplyr::mutate(eFGP = (FGM + 0.5 * TPM)/FGA) 
  • eFGP leaders are mostly centers who don’t shoot 3’s
# A tibble: 5 × 6
  Player         season  eFGP   FGP   TPA n_games
  <chr>           <int> <dbl> <dbl> <dbl>   <int>
1 Daniel Gafford   2024 0.725 0.725     0      74
2 Walker Kessler   2023 0.721 0.720     3      74
3 DeAndre Jordan   2017 0.714 0.714     2      81
4 Rudy Gobert      2022 0.713 0.713     4      66
5 DeAndre Jordan   2015 0.711 0.710     4      82
  • We can restrict to players with \(\textrm{FGA} > 400\) and \(\textrm{TPA} > 100\)
# A tibble: 5 × 6
  Player          season  eFGP   FGP   TPA n_games
  <chr>            <int> <dbl> <dbl> <dbl>   <int>
1 Kyle Korver       2015 0.671 0.487   449      75
2 Duncan Robinson   2020 0.667 0.470   606      73
3 Obi Toppin        2024 0.660 0.571   260      83
4 Nikola Jokic      2023 0.660 0.632   149      69
5 Joe Harris        2021 0.655 0.502   427      69

Reflection

  • eFGP is arguably better than FGP and FGM
  • But it is still highly variable
  • Refined question: instead of “who’s the best”, we can ask

“what is the probability that a player makes a shot”

  • Later: methods to estimate these probs. that
    • Account for contextual factors (e.g., shot location)
    • Produce stable estimates even w/ small sample sizes
    • Avoid imposing arbitrary cut-offs (e.g., \(\textrm{FGA} > 400\))
    • Calibrate performance against transparent baselines (e.g., “above replacement”)
  • Analytics involves motivating and justifying choices

More Course Logistics

Requisites

  • STAT 333 or 340
    • Random variables, expectations, probability
    • Fitting linear models & interpreting outputs
  • Prior experience with R
    • Assignment, scripting, loops, control flow
    • Saving objects, installing & loading packages.
    • Data manipulation with dplyr and other tidyverse packages.
    • Creating visualizations

Course Website

Also Canvas & Piazza

Assignments & Grading

  • 3 Group Projects (900 pts): Due on 10/10, 11/7, and 12/5
    • Written Report (100 pts) & Presentation (100 pts)
    • Peer Reviews (60 pts)
    • Team Accountability Survey (40 pts)
  • Participation (100 pts): Assessed holistically
    • Regularly attend lectures & office hours
    • Contribute to Piazza discussions
  • Final grade based on how many of the 1000 points earned

Projects

  • Work in groups of up to 4
    • Same groups for Projects 1 & 2; can change for Project 3
    • Form groups by Friday September 12
    • Use Piazza to find teammates & sign up on Canvas
  • Course projects can
    • Modify or extend analysis from lecture (e.g., to new sport)
    • Answer a new question not covered in lecture
  • Ideally jump-start a portfolio that you can show teams

Project Report

  • Executive Summary:
    • Non-technical overview of goals, methods, and results
    • Audience: front office executive, coach, or player
  • Technical Report:
    • Include all code needed to reproduce findings
    • Tightly integrate code & output w/ written exposition
    • Audience: fellow data scientists

I highly recommend using Quarto or RMarkdown

Project Presentation

  • Record & upload a 8–10 minute presentation
  • Every group member must speak
  • I’ll select 5-6 groups to present in-class on the last day (12/9)
    • Details to be announced
    • Presenting on the last day unrelated to course grades
    • But there’ll be prizes

Project Peer Review

  • Provide feedback on 3 presentations & 3 reports
    • Rubrics will be available
    • Fill out rubric + leave constructive comments
  • Due 1 weeks after projects: 10/17, 11/14, and 12/12

Team Accountability Survey

  • Assign score from 0 (least) to 10 (most) for
    • Participation, Preparation, and Respectfulness
  • Rate every group member (including yourself)
  • Ratings and comments will be kept anonymous
  • Course staff may give warnings for low peer score

Topic List

Unit 1: Quantifying Performances

  • Expected vs actual performance
  • Value of a game state
  • Performance above “replacement” level
  • Case Studies:
    • Expected goals in soccer (Lectures 2 & 3)
    • Adjusted plus/minus in the NBA (Lectures 4 & 5)
    • Run expectancy & WAR in baseball (Lectures 6–8)
    • WAR for the NFL (Lectures 9 & 10)
    • Pitch Framing in baseball (Lecture 11)

Project 1

  • Use box-score or play-by-play data
  • Introduce a new metric & show it has favorable properties
    • Season-to-season stability
    • Ability to predict match- or season-level outcomes
    • Reveal new insight about player valuation
  • Evaluate individual or team performance in a new sport
    • WAR for volleyball or college football?
    • Expected goals in hockey?
  • Extend case study from lecture
    • Construct analogs for another sport
    • Examine season-to-season variability

Unit 2: Rankings & Simulation

  • Use models to simulate games, tournaments, drafts, etc.
  • Case Studies:
    • NCAA Volleyball & Hockey Tournaments (Lectures 12 & 13)
    • Markov chain simulations (Lecture 14 & 15)
    • Building a consensus mock draft (Lecture 16)
    • Estimating impact of a rule change (Lecture 17)

Project 2

  • Fit a model to estimate latent team- or player- strength
  • Use model estimates to simulate plays, games, tournaments, drafts, etc.
  • Compute probabilities based on simulation
    • Prob. of winning tournament or making it past 1st round
    • Prob. football drive ends in a touchdown
    • Prob. of winning a cricket test
    • Chance player available in 2nd round of draft

Unit 3: Tracking Data

  • Currently the hottest area of sports analytics
    • Trackman + Statcast in baseball
    • NFL: player positions 10 times per second
    • Hawkeye for tennis, Hudl for soccer, …
  • Tracking data opens up many possibilities
    • Incorporating spatial info. into prediction models
    • Creating new metrics using tracking data
    • Space ownership & predicting trajectories

Project 3

  • Analyze tracking data

  • NFL’s Big Data Bowl is a great opportunity

    • Best way into the field (even for non-football sports)
    • Cash prizes & chance to present at Combine
  • I highly encourage turning Project 3 into a BDB submission

    • More info will be posted on Piazza
    • I’ll provide additional feedback/input for teams that do submit

Final Reminders

Course Expectations

  • Respect diverse backgrounds

    • Not everyone may know as much about a sport/method as You
    • You may not know as much about a sport/method as someone else
  • Don’t hesitate to ask for and to provide help!

  • Take care of yourself & each other

Generative AI Expectations 1

  • You have the right to the full benefit of my expertise and engagement in this course.
  • I will therefore never use AI to
    • To provide feedback on assignments
    • To prepare any course content (e.g., slides, code, assignment flavor-text, etc.)
    • To mediate or assist any communications with you
  • Everything you see in the course was created by me without the aid of generative AI

Generative AI Expectations 2

  • A core theme of this class is practice
    • It is OK to make mistakes or not know an answer
    • Process is more important than results
  • Generative AI short-circuits the intended learning process
  • Its use is expressly prohibited

Peer Review

Respect your classmates enough to review their projects yourself. Uploading someone else’s project report or presentation to a generative AI tool (e.g., for creating summaries) is forbidden and will result in a failing grade.

Looking Ahead

  • Lectures 2 & 3: Expected Goals in Soccer
  • Will use public data provided by Hudl
  • Be sure to install the StatsBombR & ranger packages
devtools::install_github("statsbomb/StatsBombR")
install.packages("ranger")