Fitting More Complex XG Models
# A tibble: 2 × 4
shot.body_part.name shot.technique.name XG1 XG2
<chr> <chr> <dbl> <dbl>
1 Right Foot Half Volley 0.111 0.089
2 Right Foot Backheel 0.111 0.103
\[ \textrm{MISS} = n^{-1}\sum_{i = 1}^{n}{\mathbb{I}(y_{i} \neq \mathbb{I}(\hat{p}_{i} \geq 0.5))}, \]
Model 1 misclassification 0.112
Model 2 misclassificaiton 0.112
# A tibble: 3 × 2
shot.body_part.name XG1
<chr> <dbl>
1 Right Foot 0.111
2 Left Foot 0.114
3 Head 0.112
# A tibble: 14 × 3
shot.body_part.name shot.technique.name XG2
<chr> <chr> <dbl>
1 Right Foot Volley 0.0637
2 Right Foot Normal 0.121
3 Right Foot Half Volley 0.0892
4 Left Foot Overhead Kick 0
5 Left Foot Normal 0.121
6 Head Normal 0.113
7 Left Foot Half Volley 0.0676
8 Left Foot Volley 0.163
9 Head Diving Header 0
10 Right Foot Lob 0.208
11 Right Foot Backheel 0.103
12 Right Foot Overhead Kick 0.0714
13 Left Foot Lob 0
14 Left Foot Backheel 0
Model 1 Brier Score: 0.1
Model 2 Brier Score: 0.099
\[ \textrm{LogLoss} = -1 \times \sum_{i = 1}^{n}{\left[ y_{i} \times \log(\hat{p}_{i}) + (1 - y_{i})\times\log(1-\hat{p}_{i})\right]}. \]
Model 1 Log-Loss: 0.351
Model 2 Log-Loss: 0.348
model1 <-
train_data |>
dplyr::group_by(shot.body_part.name) |>
dplyr::summarise(XG1 = mean(Y))
model2 <-
train_data |>
dplyr::group_by(shot.body_part.name, shot.technique.name) |>
dplyr::summarise(XG2 = mean(Y), .groups = "drop")
train_preds <-
train_data |>
dplyr::inner_join(y = model1, by = c("shot.body_part.name")) |>
dplyr::inner_join(y = model2, by = c("shot.body_part.name", "shot.technique.name"))test_preds <-
test_data |>
dplyr::inner_join(y = model1, by = c("shot.body_part.name")) |>
dplyr::inner_join(y = model2, by = c("shot.body_part.name", "shot.technique.name"))
logloss(train_preds$Y, train_preds$XG1)
logloss(test_preds$Y, test_preds$XG1)
logloss(train_preds$Y, train_preds$XG2)
logloss(test_preds$Y, test_preds$XG2)BodyPart train log-loss: 0.357 test log-loss: 0.334
BodyPart+Technique train log-loss: 0.355 test log-loss: 0.351
for() loop:
Model 1 training logloss: 0.351
Model 2 training logloss: 0.348
Model 1 test logloss: 0.352
Model 2 test logloss: 0.356
DistToGoal?Binary outcome \(Y\) and numerical predictors \(X_{1}, \ldots, X_{p}\): \[ \log\left(\frac{\mathbb{P}(Y= 1 \vert \boldsymbol{\mathbf{X}})}{\mathbb{P}(Y = 0 \vert \boldsymbol{\mathbf{X}})}\right) = \beta_{0} + \beta_{1}X_{1} + \cdots + \beta_{p}X_{p}. \]
Keeping all other predictors constant, a one unit change in \(X_{j}\) associated with a \(\beta_{j}\) change in the log-odds
Say \(\beta_{j} = 1\). Increasing \(X_{j}\) by 1 unit moves \(\mathbb{P}(Y = 1)\)
DistToGoal)
Call:
glm(formula = Y ~ DistToGoal, family = binomial("logit"), data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.134174 0.128410 -1.045 0.296
DistToGoal -0.127023 0.009115 -13.935 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2549.7 on 3569 degrees of freedom
Residual deviance: 2290.7 on 3568 degrees of freedom
AIC: 2294.7
Number of Fisher Scoring iterations: 6
predict() to make test set predictionsDist training logloss: 0.321
Dist testing logloss: 0.305
Dist*BodyPart training logloss: 0.3167
Dist*BodyPart testing logloss: 0.3173
\[ \beta_{0} + \beta_{1}\times \textrm{DistToGoal} + \\ \beta_{\textrm{LeftFoot}}\times \mathbb{I}(\textrm{LeftFoot}) + \beta_{\textrm{RightFoot}} \times \mathbb{I}(\textrm{RightFoot}) \]
Different predictions based on the body part used to attempt the shot
For a shot taken at distance \(d\):
fit <- glm(formula = Y~DistToGoal + shot.body_part.name,
data = train_data, family = binomial("logit"))
summary(fit)
Call:
glm(formula = Y ~ DistToGoal + shot.body_part.name, family = binomial("logit"),
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.49847 0.15504 -3.215 0.0013 **
DistToGoal -0.17501 0.01081 -16.187 < 2e-16 ***
shot.body_part.nameLeft Foot 1.28091 0.17286 7.410 1.26e-13 ***
shot.body_part.nameRight Foot 1.30351 0.15711 8.297 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2533.3 on 3569 degrees of freedom
Residual deviance: 2167.4 on 3566 degrees of freedom
AIC: 2175.4
Number of Fisher Scoring iterations: 6
fit <-
glm(formula = Y~DistToGoal * shot.body_part.name,
data = train_data, family = binomial("logit"))
summary(fit)
Call:
glm(formula = Y ~ DistToGoal * shot.body_part.name, family = binomial("logit"),
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.04701 0.41889 2.499 0.01244
DistToGoal -0.34422 0.05092 -6.761 1.37e-11
shot.body_part.nameLeft Foot -0.21195 0.50881 -0.417 0.67700
shot.body_part.nameRight Foot -0.82193 0.46059 -1.785 0.07434
DistToGoal:shot.body_part.nameLeft Foot 0.16418 0.05457 3.009 0.00263
DistToGoal:shot.body_part.nameRight Foot 0.20871 0.05233 3.988 6.66e-05
(Intercept) *
DistToGoal ***
shot.body_part.nameLeft Foot
shot.body_part.nameRight Foot .
DistToGoal:shot.body_part.nameLeft Foot **
DistToGoal:shot.body_part.nameRight Foot ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2549.7 on 3569 degrees of freedom
Residual deviance: 2214.5 on 3564 degrees of freedom
AIC: 2226.5
Number of Fisher Scoring iterations: 6
BodyPart training logloss: 0.351
BodyPart+Technique training logloss: 0.348
BodyPart test logloss: 0.352
BodyPart+Technique test logloss: 0.356
Dist*BodyPart training logloss: 0.3167
Dist*BodyPart testing logloss: 0.3173
Dist*BodyPart training logloss: 0.3069
Dist*BodyPart testing logloss: 0.308
Dist*BodyPart training logloss: 0.3047
Dist*BodyPart testing logloss: 0.3066
[1] "shot.type.name" "shot.technique.name" "shot.body_part.name"
[4] "DistToGoal" "DistToKeeper" "AngleToGoal"
[7] "AngleToKeeper" "AngleDeviation" "avevelocity"
[10] "density" "density.incone" "distance.ToD1"
[13] "distance.ToD2" "AttackersBehindBall" "DefendersBehindBall"
[16] "DefendersInCone" "InCone.GK" "DefArea"
glm()shot_vars <-
c("Y",
"shot.type.name",
"shot.technique.name", "shot.body_part.name",
"DistToGoal", "DistToKeeper", # dist. to keeper is distance from GK to goal
"AngleToGoal", "AngleToKeeper",
"AngleDeviation",
"avevelocity","density", "density.incone",
"distance.ToD1", "distance.ToD2",
"AttackersBehindBall", "DefendersBehindBall",
"DefendersInCone", "InCone.GK", "DefArea")
wi_shots <-
wi_shots |>
dplyr::mutate(
shot.type.name = factor(shot.type.name),
shot.body_part.name = factor(shot.body_part.name),
shot.technique.name = factor(shot.technique.name))
set.seed(479)
train_data <-
wi_shots |>
dplyr::slice_sample(n = n_train) |>
dplyr::select(dplyr::all_of(c("id",shot_vars)))
test_data <-
wi_shots |>
dplyr::anti_join(y = train_data, by = "id") |>
dplyr::select(dplyr::all_of(c("id", shot_vars)))
y_train <- train_data$Y
y_test <- test_data$Y
train_data <-
train_data |>
dplyr::mutate(Y = factor(Y, levels = c(0,1))) |>
dplyr::select(-id)
test_data <-
test_data |>
dplyr::mutate(Y = factor(Y, levels = c(0,1))) |>
dplyr::select(-id)RandomForest training logloss: 0.1113
RandomForest testing logloss: 0.2705
RandomForest training logloss: 0.154
StatsBomb training logloss: 0.2642