install.packages("hoopR")Lecture 0: Boxscore Metrics
Motivation: The Best Shooting Season in the NBA?
Who is the best shooter in the NBA? How do we determine this using data?
In this note, we will practice using functions from the tidyverse suite of packages (especially dplyr) to manipulate tables of NBA box score data. Hopefully, much of the functionality we encounter in this lecture will be familiar to you. But, if you need a high-level refresher, I highly recommend the following resources:
Basic Box Score Statistics
We will use the package hoopR to scrape NBA boxscore data. You should install the package using the code
The function hoopr::load_nba_player_box() loads season-level box-score data:
raw_box <-
hoopR::load_nba_player_box(seasons = 2002:(hoopR::most_recent_nba_season()))The data table raw_box contains 813246 rows and 57 columns. Checking the column names, we see that there are columns for the numbers of field goals, three point shots, and free throws made and attempted.
colnames(raw_box) [1] "game_id" "season"
[3] "season_type" "game_date"
[5] "game_date_time" "athlete_id"
[7] "athlete_display_name" "team_id"
[9] "team_name" "team_location"
[11] "team_short_display_name" "minutes"
[13] "field_goals_made" "field_goals_attempted"
[15] "three_point_field_goals_made" "three_point_field_goals_attempted"
[17] "free_throws_made" "free_throws_attempted"
[19] "offensive_rebounds" "defensive_rebounds"
[21] "rebounds" "assists"
[23] "steals" "blocks"
[25] "turnovers" "fouls"
[27] "plus_minus" "points"
[29] "starter" "ejected"
[31] "did_not_play" "active"
[33] "athlete_jersey" "athlete_short_name"
[35] "athlete_headshot_href" "athlete_position_name"
[37] "athlete_position_abbreviation" "team_display_name"
[39] "team_uid" "team_slug"
[41] "team_logo" "team_abbreviation"
[43] "team_color" "team_alternate_color"
[45] "home_away" "team_winner"
[47] "team_score" "opponent_team_id"
[49] "opponent_team_name" "opponent_team_location"
[51] "opponent_team_display_name" "opponent_team_abbreviation"
[53] "opponent_team_logo" "opponent_team_color"
[55] "opponent_team_alternate_color" "opponent_team_score"
[57] "reason"
Notice as well that there are columns for the game date (game_date), game id (game_id), and player (e.g., athlete_display_name). This suggests that each row corresponds to a unique combination of game and player and records the players individual statistics in that game.
For instance, here are the box score statistics for several players from a single game in 2011.
raw_box |>
dplyr::filter(game_date == "2011-06-12") |>
dplyr::select(athlete_display_name,
field_goals_made, field_goals_attempted,
three_point_field_goals_made, three_point_field_goals_attempted,
free_throws_made, free_throws_attempted)── ESPN NBA Player Boxscores from hoopR data repository ───────── hoopR 2.1.0 ──
ℹ Data updated: 2025-07-31 06:39:25 CDT
# A tibble: 30 × 7
athlete_display_name field_goals_made field_goals_attempted
<chr> <int> <int>
1 Dirk Nowitzki 9 27
2 Tyson Chandler 2 4
3 Jason Kidd 2 4
4 Shawn Marion 4 10
5 J.J. Barea 7 12
6 Brian Cardinal 1 1
7 Caron Butler NA NA
8 Ian Mahinmi 2 3
9 Rodrigue Beaubois NA NA
10 DeShawn Stevenson 3 5
# ℹ 20 more rows
# ℹ 4 more variables: three_point_field_goals_made <int>,
# three_point_field_goals_attempted <int>, free_throws_made <int>,
# free_throws_attempted <int>
As a sanity check, we can cross-reference the data in our table with the box score from ESPN. Luckily, these numbers match up!
It turns out that raw_box contains much more data than we need. Specifically, it includes statistics from play-in and play-off games as well as data from some (but not all) All-Star games. Since we’re ultimately interested in identifying the best player-seasons in terms of shooting performance, we need to remove all play-off, play-in, and All-Star games from the dataset. Additionally, the column did_not_play contains a Boolean (i.e., logical) variable that is TRUE is the player did not play in the game and is FALSE if the player did not play in the game
allstar_dates <-
lubridate::date(c("2002-02-10", "2003-02-09", "2004-02-15",
"2005-02-20", "2006-02-19", "2007-02-18",
"2008-02-17", "2009-02-15", "2010-02-14",
"2011-02-20", "2012-02-26", "2013-02-17",
"2014-02-16", "2015-02-15", "2016-02-14",
"2017-02-19", "2018-02-18", "2019-02-17",
"2020-02-16", "2021-03-07", "2022-02-20",
"2023-02-19", "2024-02-18", "2025-02-16"))
reg_box <-
raw_box |>
dplyr::filter(
season_type == 2 &
!did_not_play &
!game_date %in% allstar_dates)Looking at the data table reg_box, we see that in about 9% of rows, the number of minutes played is missing. These likely correspond to players who were active but did not play or logged only a few seconds (generally at the end of games). We will replace these NA values with 0’s and, while doing so, rename some of the columns in reg_box.
reg_box <-
reg_box |>
dplyr::rename(
Player = athlete_display_name,
FGM = field_goals_made,
FGA = field_goals_attempted,
TPM = three_point_field_goals_made,
TPA = three_point_field_goals_attempted,
FTM = free_throws_made,
FTA = free_throws_attempted) |>
dplyr::mutate(
FGM = ifelse(is.na(minutes), 0, FGM),
FGA = ifelse(is.na(minutes), 0, FGA),
TPM = ifelse(is.na(minutes), 0, TPM),
TPA = ifelse(is.na(minutes), 0, TPA),
FTM = ifelse(is.na(minutes), 0, FTM),
FTA = ifelse(is.na(minutes), 0, FTA)) |>
tidyr::replace_na(list(minutes = 0))- 1
- Rename several variables
- 2
-
For those rows where
minutesis NA, set the numbers of makes and attempts to 0 - 3
-
Replace missing
minutesvalues with 0
At this point, every row of reg_box corresponds to a player-game combination. We ultimately wish to sum up the number of makes and misses of each shot type across an entire season for each player. To illustrate this, let’s focus on Dirk Nowitzki’s performance in the 2006-07 season when he won the league MVP award. Conceptually, we can accomplish this by first dividing the full data table into several smaller tables, one for each combination of player and season. Then, we can sum the number of field goals, three point shots, and free throws attempted and made by each player in each of their season. This is an example of the split-apply-combine strategy in which you “break up a big problem into manageable pieces, operate on each piece independently, and then put all the pieces back together” (Wickham 2011). This functionality is implemented using dplyr::group_by()
season_box <-
reg_box |>
dplyr::group_by(Player, season) |>
dplyr::summarise(
FGM = sum(FGM),
FGA = sum(FGA),
TPM = sum(TPM),
TPA = sum(TPA),
FTM = sum(FTM),
FTA = sum(FTA),
minutes = sum(minutes),
n_games = dplyr::n(),
.groups = "drop")The data table season_box contains 11920 rows, each of corresponds to a single player-season combination. Here is a quick snapshot of some of the data for Dirk Nowitzki
season_box |>
dplyr::filter(Player == "Dirk Nowitzki") |>
dplyr::select(season, FGM, FGA, TPM, TPA, FTM, FTA)# A tibble: 18 × 7
season FGM FGA TPM TPA FTM FTA
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2002 600 1258 139 350 440 516
2 2003 690 1489 148 390 483 548
3 2004 605 1310 99 290 371 423
4 2005 663 1445 91 228 615 708
5 2006 751 1564 110 271 539 598
6 2007 673 1341 72 173 498 551
7 2008 630 1314 79 220 478 544
8 2009 774 1616 61 170 485 545
9 2010 720 1496 51 121 536 586
10 2011 610 1179 66 168 395 443
11 2012 473 1034 78 212 318 355
12 2013 332 707 63 151 164 191
13 2014 633 1273 131 329 338 376
14 2015 487 1062 104 274 255 289
15 2016 498 1112 126 342 250 280
16 2017 296 678 79 209 98 112
17 2018 346 758 138 337 97 108
18 2019 135 376 64 205 39 50
From Totals to Percentages
In order to determine which player-season was the best in terms of shooting, we need to first define “best”. Perhaps the simplest definition is to find the player-season with the most made shots. We can identify this player-season by sorting the data in season_box by FGM in descending order with the dplyr::arrange() function
season_box |>
dplyr::arrange(dplyr::desc(FGM))# A tibble: 11,920 × 10
Player season FGM FGA TPM TPA FTM FTA minutes n_games
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Kobe Bryant 2006 949 2109 179 506 675 788 3184 78
2 LeBron James 2006 875 1823 127 379 601 814 3361 82
3 Kobe Bryant 2003 868 1924 124 324 601 713 3401 82
4 Shai Gilgeous-Ale… 2025 868 1680 165 444 604 673 2633 77
5 LeBron James 2018 857 1580 149 406 388 531 3024 82
6 Dwyane Wade 2009 854 1739 88 278 590 771 3048 82
7 Kevin Durant 2014 849 1688 192 491 703 805 3118 81
8 James Harden 2019 843 1909 378 1028 754 858 2870 78
9 Giannis Antetokou… 2024 837 1369 34 124 514 782 2573 73
10 Tracy McGrady 2003 829 1813 173 448 576 726 2954 75
# ℹ 11,910 more rows
When we look at the ten “best” shooting seasons, we immediately recognize a lot of superstar players! On this basis, we might be satisfied evaluating shooting performances based only on the total number of shots. But taking a closer look, should we really consider Kobe Bryant’s 2002-03 and Shai Gilgeous-Alexander’s 2024-25 seasons to be equally impressive when Kobe took attempted 242 more shots than Shai in order to make 868 shots? Arguably, Shai’s 2024-25 season should rank higher than Kobe’s 2002-03 season because Shai was more efficient.
This motivates us to refine our definition of “best” by focusing on the percentage of field goals made rather than total number of field goals made.
season_box <-
season_box |>
dplyr::mutate(FGP = ifelse(FGA > 0, FGM/FGA, NA_real_))
season_box |>
dplyr::arrange(dplyr::desc(FGP)) |>
dplyr::select(Player, season, FGP)- 1
-
For players who attempted no field goals (i.e.,
FGA = 0), their field goal percentage is undefined.
# A tibble: 11,920 × 3
Player season FGP
<chr> <int> <dbl>
1 Ahmad Caver 2022 1
2 Alondes Williams 2025 1
3 Andris Biedrins 2014 1
4 Anthony Brown 2018 1
5 Braxton Key 2023 1
6 Chris Silva 2023 1
7 Dajuan Wagner 2007 1
8 DeAndre Liggins 2014 1
9 Donnell Harvey 2005 1
10 Eddy Curry 2009 1
# ℹ 11,910 more rows
Sorting the players by their \(\textrm{FGP},\) we find that several players made 100% of their field goals. But very few of these players are immediately recognizable — and, indeed, none of them have been in the MVP conversation, despite the fact that they made all their shots!
To understand what’s going on, let’s take a look at the number of attempts.
season_box |>
dplyr::arrange(dplyr::desc(FGP)) |>
dplyr::select(Player, season, FGP, FGA)# A tibble: 11,920 × 4
Player season FGP FGA
<chr> <int> <dbl> <dbl>
1 Ahmad Caver 2022 1 1
2 Alondes Williams 2025 1 2
3 Andris Biedrins 2014 1 1
4 Anthony Brown 2018 1 1
5 Braxton Key 2023 1 1
6 Chris Silva 2023 1 1
7 Dajuan Wagner 2007 1 1
8 DeAndre Liggins 2014 1 1
9 Donnell Harvey 2005 1 2
10 Eddy Curry 2009 1 2
# ℹ 11,910 more rows
Given the very low number of shots attempted in any of these player-season, claiming that any of these player-seasons are among the best ever would strain credulity! So, in order to determine the best shooting performance, we will need to threshold our data to players who took a minimum number of shots. For simplicity, let’s focus our attention on those players who attempted at least 400 field goals in a season (i.e., they attempted, on average, at least 5 shots per game).
season_box |>
dplyr::filter(FGA >= 400) |>
dplyr::arrange(dplyr::desc(FGP)) |>
dplyr::select(Player, season, FGP,FGA)# A tibble: 4,987 × 4
Player season FGP FGA
<chr> <int> <dbl> <dbl>
1 Daniel Gafford 2024 0.725 480
2 Walker Kessler 2023 0.720 414
3 DeAndre Jordan 2017 0.714 577
4 Rudy Gobert 2022 0.713 508
5 DeAndre Jordan 2015 0.710 534
6 Jarrett Allen 2025 0.706 640
7 Nic Claxton 2023 0.705 587
8 DeAndre Jordan 2016 0.703 508
9 Daniel Gafford 2025 0.702 403
10 Daniel Gafford 2022 0.693 411
# ℹ 4,977 more rows
Do we really believe that these performances, all of which were made centers who mostly shoot at or near the rim, represent some of the best shooting performances of all time?
Effective Field Goal Percentage
A major limitation of FGP is that it treats 2-point shots the same as 3-point shots. As a result, the league-leader in FGP every season is usually a center whose shots mostly come from near the rim. Effective Field Goal Percentage (eFGP) adjusts FGP to account for the fact that a made 3-point shots is worth 50% more than a made 2-point shot. The formula for eFGP is \[ \textrm{eFGP} = \frac{\textrm{FGM} + 0.5 \times \textrm{TPM}}{\textrm{FGA}} \]
season_box <-
season_box |>
dplyr::mutate(
TPP = ifelse(TPA > 0, TPM/TPA,NA_real_),
eFGP = (FGM + 0.5 * TPM)/FGA)
season_box |>
dplyr::filter(FGA >= 400) |>
dplyr::arrange(dplyr::desc(eFGP), dplyr::desc(FGP)) |>
dplyr::select(Player, season, eFGP, FGP, TPP, TPA, n_games)# A tibble: 4,987 × 7
Player season eFGP FGP TPP TPA n_games
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 Daniel Gafford 2024 0.725 0.725 NA 0 74
2 Walker Kessler 2023 0.721 0.720 0.333 3 74
3 DeAndre Jordan 2017 0.714 0.714 0 2 81
4 Rudy Gobert 2022 0.713 0.713 0 4 66
5 DeAndre Jordan 2015 0.711 0.710 0.25 4 82
6 Jarrett Allen 2025 0.706 0.706 0 5 82
7 Nic Claxton 2023 0.705 0.705 0 2 76
8 DeAndre Jordan 2016 0.703 0.703 0 1 77
9 Daniel Gafford 2025 0.702 0.702 NA 0 57
10 Daniel Gafford 2022 0.693 0.693 0 1 72
# ℹ 4,977 more rows
We see again that some of the best seasons, according to eFGP, were from centers, many of whom attempt few few three point shots. When filter out players who took at least 100 three point shots, we start to see other positions in the top-10.
season_box |>
dplyr::filter(FGA >= 400 & TPA >= 100) |>
dplyr::arrange(dplyr::desc(eFGP), dplyr::desc(FGP)) |>
dplyr::select(Player, season, eFGP, FGP, TPP, TPA, n_games)# A tibble: 3,658 × 7
Player season eFGP FGP TPP TPA n_games
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 Kyle Korver 2015 0.671 0.487 0.492 449 75
2 Duncan Robinson 2020 0.667 0.470 0.446 606 73
3 Obi Toppin 2024 0.660 0.571 0.404 260 83
4 Nikola Jokic 2023 0.660 0.632 0.383 149 69
5 Joe Harris 2021 0.655 0.502 0.478 427 69
6 Joe Ingles 2021 0.652 0.489 0.451 406 67
7 Grayson Allen 2024 0.649 0.499 0.461 445 75
8 Michael Porter Jr. 2021 0.646 0.541 0.447 374 61
9 Mikal Bridges 2021 0.643 0.543 0.425 315 72
10 Al Horford 2024 0.640 0.511 0.419 258 65
# ℹ 3,648 more rows
True Shooting Percentage #{sec-tsp}
Both FGP and eFGP totally ignore free throws. Intuitively, we should expect the best shooter to be proficient at making two-and three-point shots as well as their free throws. One metric that accounts for all field goals, three pointers, and free throws is true shooting percentage (\(\textrm{TSP}\)), whose formula is given by \[ \textrm{TSP} = \frac{\textrm{PTS}}{2 \times \left(\textrm{FGA} + (0.44 \times \textrm{FTA})\right)}, \] where \(\textrm{PTS} = \textrm{FTM} + 2 \times \textrm{FGM} + \textrm{TPM}\) is the total number of points scored.
season_box <-
season_box |>
dplyr::mutate(
PTS = FTM + 2 * FGM + TPM,
TSP = PTS/(2 * (FGA + 0.44 * FTA)))
season_box |>
dplyr::filter(FGA >= 400 & TPA >= 100) |>
dplyr::arrange(dplyr::desc(TSP), dplyr::desc(eFGP), dplyr::desc(FGP)) |>
dplyr::select(Player, season, TSP, eFGP, FGP, TPP, n_games)# A tibble: 3,658 × 7
Player season TSP eFGP FGP TPP n_games
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 Nikola Jokic 2023 0.701 0.660 0.632 0.383 69
2 Kyle Korver 2015 0.699 0.671 0.487 0.492 75
3 Austin Reaves 2023 0.687 0.616 0.529 0.398 64
4 Duncan Robinson 2020 0.684 0.667 0.470 0.446 73
5 Dwight Powell 2019 0.682 0.637 0.597 0.307 77
6 Grayson Allen 2024 0.679 0.649 0.499 0.461 75
7 Kevin Durant 2023 0.677 0.614 0.560 0.404 47
8 Moritz Wagner 2024 0.676 0.636 0.601 0.330 80
9 Stephen Curry 2018 0.675 0.618 0.495 0.423 51
10 Obi Toppin 2024 0.675 0.660 0.571 0.404 83
# ℹ 3,648 more rows