Lecture 0: Boxscore Metrics

Motivation: The Best Shooting Season in the NBA?

Who is the best shooter in the NBA? How do we determine this using data?

In this note, we will practice using functions from the tidyverse suite of packages (especially dplyr) to manipulate tables of NBA box score data. Hopefully, much of the functionality we encounter in this lecture will be familiar to you. But, if you need a high-level refresher, I highly recommend the following resources:

Chapter 3 and Chapter 5 of R for Data Science.
Section 1.9 and Chapter 3 of *Data Science: A First Introduction.

Basic Box Score Statistics

We will use the package hoopR to scrape NBA boxscore data. You should install the package using the code

install.packages("hoopR")

The function hoopr::load_nba_player_box() loads season-level box-score data:

raw_box <-
  hoopR::load_nba_player_box(seasons = 2002:(hoopR::most_recent_nba_season()))

The data table raw_box contains 813246 rows and 57 columns. Checking the column names, we see that there are columns for the numbers of field goals, three point shots, and free throws made and attempted.

colnames(raw_box)

 [1] "game_id"                           "season"                           
 [3] "season_type"                       "game_date"                        
 [5] "game_date_time"                    "athlete_id"                       
 [7] "athlete_display_name"              "team_id"                          
 [9] "team_name"                         "team_location"                    
[11] "team_short_display_name"           "minutes"                          
[13] "field_goals_made"                  "field_goals_attempted"            
[15] "three_point_field_goals_made"      "three_point_field_goals_attempted"
[17] "free_throws_made"                  "free_throws_attempted"            
[19] "offensive_rebounds"                "defensive_rebounds"               
[21] "rebounds"                          "assists"                          
[23] "steals"                            "blocks"                           
[25] "turnovers"                         "fouls"                            
[27] "plus_minus"                        "points"                           
[29] "starter"                           "ejected"                          
[31] "did_not_play"                      "active"                           
[33] "athlete_jersey"                    "athlete_short_name"               
[35] "athlete_headshot_href"             "athlete_position_name"            
[37] "athlete_position_abbreviation"     "team_display_name"                
[39] "team_uid"                          "team_slug"                        
[41] "team_logo"                         "team_abbreviation"                
[43] "team_color"                        "team_alternate_color"             
[45] "home_away"                         "team_winner"                      
[47] "team_score"                        "opponent_team_id"                 
[49] "opponent_team_name"                "opponent_team_location"           
[51] "opponent_team_display_name"        "opponent_team_abbreviation"       
[53] "opponent_team_logo"                "opponent_team_color"              
[55] "opponent_team_alternate_color"     "opponent_team_score"              
[57] "reason"

Notice as well that there are columns for the game date (game_date), game id (game_id), and player (e.g., athlete_display_name). This suggests that each row corresponds to a unique combination of game and player and records the players individual statistics in that game.

For instance, here are the box score statistics for several players from a single game in 2011.

raw_box |>
  dplyr::filter(game_date == "2011-06-12") |>
  dplyr::select(athlete_display_name, 
         field_goals_made, field_goals_attempted,
         three_point_field_goals_made, three_point_field_goals_attempted,
         free_throws_made, free_throws_attempted)

── ESPN NBA Player Boxscores from hoopR data repository ───────── hoopR 2.1.0 ──

ℹ Data updated: 2025-07-31 06:39:25 CDT

# A tibble: 30 × 7
   athlete_display_name field_goals_made field_goals_attempted
   <chr>                           <int>                 <int>
 1 Dirk Nowitzki                       9                    27
 2 Tyson Chandler                      2                     4
 3 Jason Kidd                          2                     4
 4 Shawn Marion                        4                    10
 5 J.J. Barea                          7                    12
 6 Brian Cardinal                      1                     1
 7 Caron Butler                       NA                    NA
 8 Ian Mahinmi                         2                     3
 9 Rodrigue Beaubois                  NA                    NA
10 DeShawn Stevenson                   3                     5
# ℹ 20 more rows
# ℹ 4 more variables: three_point_field_goals_made <int>,
#   three_point_field_goals_attempted <int>, free_throws_made <int>,
#   free_throws_attempted <int>

As a sanity check, we can cross-reference the data in our table with the box score from ESPN. Luckily, these numbers match up!

It turns out that raw_box contains much more data than we need. Specifically, it includes statistics from play-in and play-off games as well as data from some (but not all) All-Star games. Since we’re ultimately interested in identifying the best player-seasons in terms of shooting performance, we need to remove all play-off, play-in, and All-Star games from the dataset. Additionally, the column did_not_play contains a Boolean (i.e., logical) variable that is TRUE is the player did not play in the game and is FALSE if the player did not play in the game

allstar_dates <-
  lubridate::date(c("2002-02-10", "2003-02-09", "2004-02-15",
    "2005-02-20", "2006-02-19", "2007-02-18", 
    "2008-02-17", "2009-02-15", "2010-02-14",
    "2011-02-20", "2012-02-26", "2013-02-17", 
    "2014-02-16", "2015-02-15", "2016-02-14",
    "2017-02-19", "2018-02-18", "2019-02-17",
    "2020-02-16", "2021-03-07", "2022-02-20",
    "2023-02-19", "2024-02-18", "2025-02-16"))
reg_box <-
  raw_box |>
  dplyr::filter(
    season_type == 2 & 
      !did_not_play & 
      !game_date %in% allstar_dates)

Looking at the data table reg_box, we see that in about 9% of rows, the number of minutes played is missing. These likely correspond to players who were active but did not play or logged only a few seconds (generally at the end of games). We will replace these NA values with 0’s and, while doing so, rename some of the columns in reg_box.

reg_box <-
  reg_box |>
  dplyr::rename(
    Player = athlete_display_name,
    FGM = field_goals_made,
    FGA = field_goals_attempted,
    TPM = three_point_field_goals_made,
    TPA = three_point_field_goals_attempted,
    FTM = free_throws_made, 
    FTA = free_throws_attempted) |>
  dplyr::mutate(
    FGM = ifelse(is.na(minutes), 0, FGM),
    FGA = ifelse(is.na(minutes), 0, FGA),
    TPM = ifelse(is.na(minutes), 0, TPM),
    TPA = ifelse(is.na(minutes), 0, TPA),
    FTM = ifelse(is.na(minutes), 0, FTM),
    FTA = ifelse(is.na(minutes), 0, FTA)) |>
  tidyr::replace_na(list(minutes = 0))

1: Rename several variables
2: For those rows where minutes is NA, set the numbers of makes and attempts to 0
3: Replace missing minutes values with 0

At this point, every row of reg_box corresponds to a player-game combination. We ultimately wish to sum up the number of makes and misses of each shot type across an entire season for each player. To illustrate this, let’s focus on Dirk Nowitzki’s performance in the 2006-07 season when he won the league MVP award. Conceptually, we can accomplish this by first dividing the full data table into several smaller tables, one for each combination of player and season. Then, we can sum the number of field goals, three point shots, and free throws attempted and made by each player in each of their season. This is an example of the split-apply-combine strategy in which you “break up a big problem into manageable pieces, operate on each piece independently, and then put all the pieces back together” (Wickham 2011). This functionality is implemented using dplyr::group_by()

season_box <-
  reg_box |>
  dplyr::group_by(Player, season) |>
  dplyr::summarise(
    FGM = sum(FGM),
    FGA = sum(FGA),
    TPM = sum(TPM),
    TPA = sum(TPA),
    FTM = sum(FTM),
    FTA = sum(FTA),
    minutes = sum(minutes),
    n_games = dplyr::n(),
    .groups = "drop")

The data table season_box contains 11920 rows, each of corresponds to a single player-season combination. Here is a quick snapshot of some of the data for Dirk Nowitzki

season_box |>
  dplyr::filter(Player == "Dirk Nowitzki") |>
  dplyr::select(season, FGM, FGA, TPM, TPA, FTM, FTA)

# A tibble: 18 × 7
   season   FGM   FGA   TPM   TPA   FTM   FTA
    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   2002   600  1258   139   350   440   516
 2   2003   690  1489   148   390   483   548
 3   2004   605  1310    99   290   371   423
 4   2005   663  1445    91   228   615   708
 5   2006   751  1564   110   271   539   598
 6   2007   673  1341    72   173   498   551
 7   2008   630  1314    79   220   478   544
 8   2009   774  1616    61   170   485   545
 9   2010   720  1496    51   121   536   586
10   2011   610  1179    66   168   395   443
11   2012   473  1034    78   212   318   355
12   2013   332   707    63   151   164   191
13   2014   633  1273   131   329   338   376
14   2015   487  1062   104   274   255   289
15   2016   498  1112   126   342   250   280
16   2017   296   678    79   209    98   112
17   2018   346   758   138   337    97   108
18   2019   135   376    64   205    39    50

From Totals to Percentages

In order to determine which player-season was the best in terms of shooting, we need to first define “best”. Perhaps the simplest definition is to find the player-season with the most made shots. We can identify this player-season by sorting the data in season_box by FGM in descending order with the dplyr::arrange() function

season_box |>
  dplyr::arrange(dplyr::desc(FGM))

# A tibble: 11,920 × 10
   Player             season   FGM   FGA   TPM   TPA   FTM   FTA minutes n_games
   <chr>               <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <int>
 1 Kobe Bryant          2006   949  2109   179   506   675   788    3184      78
 2 LeBron James         2006   875  1823   127   379   601   814    3361      82
 3 Kobe Bryant          2003   868  1924   124   324   601   713    3401      82
 4 Shai Gilgeous-Ale…   2025   868  1680   165   444   604   673    2633      77
 5 LeBron James         2018   857  1580   149   406   388   531    3024      82
 6 Dwyane Wade          2009   854  1739    88   278   590   771    3048      82
 7 Kevin Durant         2014   849  1688   192   491   703   805    3118      81
 8 James Harden         2019   843  1909   378  1028   754   858    2870      78
 9 Giannis Antetokou…   2024   837  1369    34   124   514   782    2573      73
10 Tracy McGrady        2003   829  1813   173   448   576   726    2954      75
# ℹ 11,910 more rows

When we look at the ten “best” shooting seasons, we immediately recognize a lot of superstar players! On this basis, we might be satisfied evaluating shooting performances based only on the total number of shots. But taking a closer look, should we really consider Kobe Bryant’s 2002-03 and Shai Gilgeous-Alexander’s 2024-25 seasons to be equally impressive when Kobe took attempted 242 more shots than Shai in order to make 868 shots? Arguably, Shai’s 2024-25 season should rank higher than Kobe’s 2002-03 season because Shai was more efficient.

This motivates us to refine our definition of “best” by focusing on the percentage of field goals made rather than total number of field goals made.

season_box <-
  season_box |>
  dplyr::mutate(FGP = ifelse(FGA > 0, FGM/FGA, NA_real_))
season_box |> 
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::select(Player, season, FGP)

1: For players who attempted no field goals (i.e., FGA = 0), their field goal percentage is undefined.

# A tibble: 11,920 × 3
   Player           season   FGP
   <chr>             <int> <dbl>
 1 Ahmad Caver        2022     1
 2 Alondes Williams   2025     1
 3 Andris Biedrins    2014     1
 4 Anthony Brown      2018     1
 5 Braxton Key        2023     1
 6 Chris Silva        2023     1
 7 Dajuan Wagner      2007     1
 8 DeAndre Liggins    2014     1
 9 Donnell Harvey     2005     1
10 Eddy Curry         2009     1
# ℹ 11,910 more rows

Sorting the players by their \(\textrm{FGP},\) we find that several players made 100% of their field goals. But very few of these players are immediately recognizable — and, indeed, none of them have been in the MVP conversation, despite the fact that they made all their shots!

To understand what’s going on, let’s take a look at the number of attempts.

season_box |> 
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::select(Player, season, FGP, FGA)

# A tibble: 11,920 × 4
   Player           season   FGP   FGA
   <chr>             <int> <dbl> <dbl>
 1 Ahmad Caver        2022     1     1
 2 Alondes Williams   2025     1     2
 3 Andris Biedrins    2014     1     1
 4 Anthony Brown      2018     1     1
 5 Braxton Key        2023     1     1
 6 Chris Silva        2023     1     1
 7 Dajuan Wagner      2007     1     1
 8 DeAndre Liggins    2014     1     1
 9 Donnell Harvey     2005     1     2
10 Eddy Curry         2009     1     2
# ℹ 11,910 more rows

Given the very low number of shots attempted in any of these player-season, claiming that any of these player-seasons are among the best ever would strain credulity! So, in order to determine the best shooting performance, we will need to threshold our data to players who took a minimum number of shots. For simplicity, let’s focus our attention on those players who attempted at least 400 field goals in a season (i.e., they attempted, on average, at least 5 shots per game).

season_box |>
  dplyr::filter(FGA >= 400) |>
  dplyr::arrange(dplyr::desc(FGP)) |>
  dplyr::select(Player, season, FGP,FGA)

# A tibble: 4,987 × 4
   Player         season   FGP   FGA
   <chr>           <int> <dbl> <dbl>
 1 Daniel Gafford   2024 0.725   480
 2 Walker Kessler   2023 0.720   414
 3 DeAndre Jordan   2017 0.714   577
 4 Rudy Gobert      2022 0.713   508
 5 DeAndre Jordan   2015 0.710   534
 6 Jarrett Allen    2025 0.706   640
 7 Nic Claxton      2023 0.705   587
 8 DeAndre Jordan   2016 0.703   508
 9 Daniel Gafford   2025 0.702   403
10 Daniel Gafford   2022 0.693   411
# ℹ 4,977 more rows

Do we really believe that these performances, all of which were made centers who mostly shoot at or near the rim, represent some of the best shooting performances of all time?

Effective Field Goal Percentage

A major limitation of FGP is that it treats 2-point shots the same as 3-point shots. As a result, the league-leader in FGP every season is usually a center whose shots mostly come from near the rim. Effective Field Goal Percentage (eFGP) adjusts FGP to account for the fact that a made 3-point shots is worth 50% more than a made 2-point shot. The formula for eFGP is \[ \textrm{eFGP} = \frac{\textrm{FGM} + 0.5 \times \textrm{TPM}}{\textrm{FGA}} \]

season_box <-
  season_box |>
  dplyr::mutate(
    TPP = ifelse(TPA > 0, TPM/TPA,NA_real_),
    eFGP = (FGM + 0.5 * TPM)/FGA) 
season_box |>
  dplyr::filter(FGA >= 400) |>
  dplyr::arrange(dplyr::desc(eFGP), dplyr::desc(FGP)) |>
  dplyr::select(Player, season, eFGP, FGP, TPP, TPA, n_games)

# A tibble: 4,987 × 7
   Player         season  eFGP   FGP    TPP   TPA n_games
   <chr>           <int> <dbl> <dbl>  <dbl> <dbl>   <int>
 1 Daniel Gafford   2024 0.725 0.725 NA         0      74
 2 Walker Kessler   2023 0.721 0.720  0.333     3      74
 3 DeAndre Jordan   2017 0.714 0.714  0         2      81
 4 Rudy Gobert      2022 0.713 0.713  0         4      66
 5 DeAndre Jordan   2015 0.711 0.710  0.25      4      82
 6 Jarrett Allen    2025 0.706 0.706  0         5      82
 7 Nic Claxton      2023 0.705 0.705  0         2      76
 8 DeAndre Jordan   2016 0.703 0.703  0         1      77
 9 Daniel Gafford   2025 0.702 0.702 NA         0      57
10 Daniel Gafford   2022 0.693 0.693  0         1      72
# ℹ 4,977 more rows

We see again that some of the best seasons, according to eFGP, were from centers, many of whom attempt few few three point shots. When filter out players who took at least 100 three point shots, we start to see other positions in the top-10.

season_box |>
  dplyr::filter(FGA >= 400 & TPA >= 100) |>
  dplyr::arrange(dplyr::desc(eFGP), dplyr::desc(FGP)) |>
  dplyr::select(Player, season, eFGP, FGP, TPP, TPA, n_games)

# A tibble: 3,658 × 7
   Player             season  eFGP   FGP   TPP   TPA n_games
   <chr>               <int> <dbl> <dbl> <dbl> <dbl>   <int>
 1 Kyle Korver          2015 0.671 0.487 0.492   449      75
 2 Duncan Robinson      2020 0.667 0.470 0.446   606      73
 3 Obi Toppin           2024 0.660 0.571 0.404   260      83
 4 Nikola Jokic         2023 0.660 0.632 0.383   149      69
 5 Joe Harris           2021 0.655 0.502 0.478   427      69
 6 Joe Ingles           2021 0.652 0.489 0.451   406      67
 7 Grayson Allen        2024 0.649 0.499 0.461   445      75
 8 Michael Porter Jr.   2021 0.646 0.541 0.447   374      61
 9 Mikal Bridges        2021 0.643 0.543 0.425   315      72
10 Al Horford           2024 0.640 0.511 0.419   258      65
# ℹ 3,648 more rows

True Shooting Percentage #{sec-tsp}

Both FGP and eFGP totally ignore free throws. Intuitively, we should expect the best shooter to be proficient at making two-and three-point shots as well as their free throws. One metric that accounts for all field goals, three pointers, and free throws is true shooting percentage (\(\textrm{TSP}\)), whose formula is given by \[ \textrm{TSP} = \frac{\textrm{PTS}}{2 \times \left(\textrm{FGA} + (0.44 \times \textrm{FTA})\right)}, \] where \(\textrm{PTS} = \textrm{FTM} + 2 \times \textrm{FGM} + \textrm{TPM}\) is the total number of points scored.

season_box <-
  season_box |>
  dplyr::mutate(
    PTS = FTM + 2 * FGM + TPM,
    TSP = PTS/(2 * (FGA + 0.44 * FTA)))
season_box |>
  dplyr::filter(FGA >= 400 & TPA >= 100) |>
  dplyr::arrange(dplyr::desc(TSP), dplyr::desc(eFGP), dplyr::desc(FGP)) |>
  dplyr::select(Player, season, TSP, eFGP, FGP, TPP, n_games)

# A tibble: 3,658 × 7
   Player          season   TSP  eFGP   FGP   TPP n_games
   <chr>            <int> <dbl> <dbl> <dbl> <dbl>   <int>
 1 Nikola Jokic      2023 0.701 0.660 0.632 0.383      69
 2 Kyle Korver       2015 0.699 0.671 0.487 0.492      75
 3 Austin Reaves     2023 0.687 0.616 0.529 0.398      64
 4 Duncan Robinson   2020 0.684 0.667 0.470 0.446      73
 5 Dwight Powell     2019 0.682 0.637 0.597 0.307      77
 6 Grayson Allen     2024 0.679 0.649 0.499 0.461      75
 7 Kevin Durant      2023 0.677 0.614 0.560 0.404      47
 8 Moritz Wagner     2024 0.676 0.636 0.601 0.330      80
 9 Stephen Curry     2018 0.675 0.618 0.495 0.423      51
10 Obi Toppin        2024 0.675 0.660 0.571 0.404      83
# ℹ 3,648 more rows

References

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. https://doi.org/10.18637/jss.v040.i01.