Please spend a few minutes reading through the notes from Lecture 2. Like in Problem Set 1, you should go through each code block with someone in your group and see if you can both explain to each other what all of the code does.
In lecture, Professory Wyner discussed the relationship between a team’s payroll and its winning percentage. In particular, for each season, he computed the “relative payroll” of each team by taking its payroll and dividing it by the median of payrolls of all teams in that seaosn. We will replicate his analysis in the following problems using the dataset “mlb_relative_payrolls.csv”, which you can find in the “data/” folder of your working directory. You should save all of the code for this analysis in an R script called “ps2_mlb_payroll.R”.
## Parsed with column specification:
## cols(
## Team = col_character(),
## GM = col_character(),
## Team_Payroll = col_double(),
## Winning_Percentage = col_double(),
## Year = col_double(),
## Relative_Payroll = col_double()
## )
Make a histogram of team winning percentages. Play around with different binwidths.
Make a histogram of the relative payrolls.
Make a scatterplot with relative payroll on the horizontal axis and winning percentage on the vertical axis.
Without executing the code below, discuss with your group and see if you can figure out what it is doing.
In this problem set, we will gain more experience using the dplyr verbs we learned in Module 3 to analyze batting statistics of MLB players with at least 502.2 plate appearances. All of the data is contained in the file “data/hitting_qualified.csv”. You should write save all of the code for this analyses in an R script called “ps2_mlb_batting.R”.
hitting_qualified
using read_csv()
.## Parsed with column specification:
## cols(
## .default = col_double(),
## playerID = col_character(),
## teamID = col_character(),
## lgID = col_character(),
## CS = col_logical(),
## IBB = col_logical(),
## SF = col_logical(),
## GIDP = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 28057 parsing failures.
## row col expected actual file
## 1870 CS 1/0/T/F/TRUE/FALSE 23 'data/hitting_qualified.csv'
## 1871 CS 1/0/T/F/TRUE/FALSE 20 'data/hitting_qualified.csv'
## 1872 CS 1/0/T/F/TRUE/FALSE 13 'data/hitting_qualified.csv'
## 1877 CS 1/0/T/F/TRUE/FALSE 15 'data/hitting_qualified.csv'
## 1880 CS 1/0/T/F/TRUE/FALSE 13 'data/hitting_qualified.csv'
## .... ... .................. ...... ............................
## See problems(...) for more details.
The columns of this dataset include
playerID
: the player’s ID codeyearID
: Yearstint
: the player’s stint (order of appearances within a season)teamID
: the player’s teamlgID
: the player’s leagueG
: the number of Games the player played in that yearAB
: number of At Bats of that player in that yearPA
: number of plate appearances by the player that yearR
: number of Runs the player made in that yearH
: number of Hits the player had in that yearX2B
: number of Doubles (hits on which the batter reached second base safely)X3B
: number of Triples (hits on which the batter reached third base safely)HR
: number of Homeruns the player made that yearRBI
: number of Runs Batted In the player made that yearSB
: number of Bases Stolen by the player in that yearCS
: number of times a player was Caught Stealing that yearBB
: Base on BallsSO
: number of Strikeouts the player had that yearIBB
Intentional walksHBP
: Hit by pitchSH
: Sacrifice hitsSF
Sacrifice fliesGIDP
Grounded into double playsUse arrange()
to find out the first and last season for which we have data. Hint: you may need to use desc()
as well.
Use summarize()
to find out the first and last season for which we have data. Hint, you only need one line of code to do this
When you print out hitting_qualified
you’ll notice that some columns were read in as characters and not integers or numerics. This can happen sometimes whenever the original csv file has missing values. In this case, the columns IBB, HBP, SH, SF, and GIDP were read in as characters. We want to convert these to integers. We can do this using mutate()
and the function as.integer()
.
hitting_qualified <- mutate(hitting_qualified,
IBB = as.integer(IBB),
HBP = as.integer(HBP),
SH = as.integer(SH),
SF = as.integer(SF),
GIDP = as.integer(GIDP))
## # A tibble: 12,043 x 8
## playerID yearID AB IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 ansonca01 1884 475 NA NA NA NA NA
## 2 bradyst01 1884 485 NA 0 NA NA NA
## 3 connoro01 1884 477 NA NA NA NA NA
## 4 dalryab01 1884 521 NA NA NA NA NA
## 5 farreja02 1884 469 NA NA NA NA NA
## 6 gleasbi01 1884 472 NA 12 NA NA NA
## 7 hinespa01 1884 490 NA NA NA NA NA
## 8 hornujo01 1884 518 NA NA NA NA NA
## 9 jonesch01 1884 472 NA 10 NA NA NA
## 10 nelsoca01 1884 432 NA 9 NA NA NA
## # … with 12,033 more rows
You’ll notice that a lot of these columns contain NA
values, which indicates that some of these values are missing. This make sense, since a lot of these statistics were not recorded in the early years of baseball. A popular convention for dealing with these missing statistics is to impute the missing values with 0. That is, for instance, every place we see an NA
we need to replace it with a 0. We can do that with mutate()
and replace_na()
function as follows.
hitting_qualified <- replace_na(hitting_qualified,
list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
We will discuss the syntax for replace_na()
later in lecture.
mutate()
to add a column for the number of singles, which can be computed as \(\text{X1B} = \text{H} - \text{X2B} - \text{X3B} - \text{HR}\).The variable BB includes as a subset all intentional walks (IBB). Use mutate()
to add a column to hitting_qualified
that counts the number of un-intentional walks (uBB). Be sure to save the resulting tibble as hitting_qualified
.
Use mutate()
to add columns for the following offensive statistics, whose formulae are given below. We have also included links to pages on Fangraphs that define and discuss each of these statistics.
Strike-out Percentage (KP): \[\text{KP} = \frac{\text{SO}}{\text{PA}}\]
Slugging (SLG): \[ \text{SLG} = \frac{\text{X1B} + 2 \times \text{X2B} + 3\times \text{X3B} + 4\times \text{HR}}{\text{AB}} \]
weighted On-Base Average (wOBA): We will use the 2013 weights which can be found here \[ \text{wOBA} = \frac{0.687 \times \text{uBB} + 0.718 \times \text{HBP} + 0.881 \times \text{X1B} + 1.256 \times \text{X2B} + 1.594 \times \text{X3B} + 2.065 \times \text{HR}}{\text{AB} + \text{uBB} + \text{SF} + \text{HBP}} \]
hitting_qualified <- mutate(hitting_qualified,
BBP = BB/PA,
KP = SO/PA,
OBP = (H + BB + HBP)/(AB + BB + HBP + SF),
SLG = (X1B + 2*X2B + 3*X3B + 4*HR)/AB,
OPS = OBP + SLG,
wOBA = (0.687 * uBB + 0.718 * HBP + 0.81 * X1B + 1.256 * X2B +
1.594 * X3B+ 2.065 * HR)/(AB + uBB + SF + HBP))
For most of the statistics in the previous question, Fangraphs has defined rating scales (to see these ratings, click on the linked page for each statistic in Question 6 and scroll down to the “Context” section of the page). Use mutate()
and case_when()
to add the ratings for walk percentage (BBP), strike-out percentage (KP), on-base percentage (OBP), on-base plus slugging (OPS), and wOBA. Call the columns “BBP_rating”, “KP_rating”, “OBP_rating”, “OPS_rating”, and “wOBA_rating.”
Use filter()
to subset the players who played between 2000 and 2015. Call the new tbl tmp_batting
.
select()
to create a tibble called batting_recent
containing all players who played between 2000 and 2015 with the following columns: playerID, yearID, teamID, lgID, and all of the statistics and rankings created in Problems 6 and 7.Explore the distribution of some of the batting statistics introduced in problem 6 using the tbl batting_recent
using histograms. Then explore the relationship between some of these statistics with scatterplots.