Author

Arthur Andrews

Goals

  • Study the aging of MLB player performance
  • Reveal the bias of estimating aging curves with simple aggregation
  • Demonstrate a statistics model for aging using splines and mixed effects modeling
  • Show some players whose aging curves are unusual or exceptional

Purpose

This is a hobby project.

Tools

  • R - programming language
  • baseballr - an R interface to MLB data
  • quarto - publishing data-rich documents from code
  • GitHub Pages - hosting code and web pages

Code

By default, the R code is displayed. If you wish, you may hide it from the “Code” menu in the top-right.

Dependencies

Code
library(tidyverse)
library(broom.mixed)
library(janitor)
library(splines)
library(factoextra)
library(reactable)
library(ggsci)
library(magrittr, include.only = "set_rownames")
library(lme4, exclude = c("expand", "pack", "unpack"))
library(htmltools, include.only = "css")
library(pracma, include.only = c("fzero", "fderiv"))

theme_set(theme_bw(base_size = 12))
Code
load("data/hitting_stats.RData")

Declare functions

For adding model predictions to a data frame and tabulating results in html tables.

Code
add_prediction <- function(hit, model, ...) {
  hit$pred_ops <- predict(model, hit, ...)
  hit
}


tabulate <- partial(
  reactable,
  fullWidth = FALSE, highlight = TRUE, style = css(fontSize = "80%")
)

Data set

I accessed MLB hitter data from 2005 - 2023 using the excellent baseballr package, which offers convenient interfaces to the Statcast APIs to Baseball Reference data.

I include - seasons with at least 200 plate appearances - and hitters with at least 5 seasons

The data loaded by this notebook was first prepared in a separate script that you may find in the code repository: query_mlb.R.

Code
hit <- hit |> 
  filter(n_seasons >= 5) 
Code
hit |> 
  group_by(season) |> 
  summarize(rows = n(), hitters = n_distinct(id)) |> 
  tabulate(defaultPageSize = 15)

Print a glimpse of the data to see the schema and number of rows and columns.

Code
glimpse(hit)
Rows: 4,285
Columns: 13
$ id           <int> 110029, 110029, 110029, 110029, 110029, 110029, 110029, 1…
$ name         <chr> "Bobby Abreu", "Bobby Abreu", "Bobby Abreu", "Bobby Abreu…
$ season       <chr> "2005", "2006", "2007", "2008", "2009", "2010", "2011", "…
$ team_name    <chr> "Philadelphia Phillies", "New York Yankees", "New York Ya…
$ ab           <int> 588, 548, 605, 609, 563, 573, 502, 219, 575, 543, 417, 55…
$ pa           <int> 719, 686, 699, 684, 667, 667, 585, 257, 603, 588, 450, 59…
$ age          <dbl> 31.30732, 32.30664, 33.30595, 34.30801, 35.30732, 36.3066…
$ avg          <dbl> 0.286, 0.297, 0.283, 0.296, 0.293, 0.255, 0.253, 0.242, 0…
$ obp          <dbl> 0.405, 0.424, 0.369, 0.371, 0.390, 0.352, 0.353, 0.350, 0…
$ slg          <dbl> 0.474, 0.462, 0.445, 0.471, 0.435, 0.435, 0.365, 0.342, 0…
$ ops          <dbl> 0.879, 0.886, 0.814, 0.842, 0.825, 0.787, 0.718, 0.692, 0…
$ n_seasons    <int> 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, …
$ centered_age <dbl> 2.158141, 3.157457, 4.156772, 5.158826, 6.158141, 7.15745…

Splines

I will use three B-splines to provide a flexible basis for describing the nonlinear aging curve.

Each of these splines is a smooth function of player age, and by including the split fits as predictor variables in a model, the model will be able to describe a nonlinear relationship between age and OPS. The coefficients will define the shape of the relationship.

With the splines package, the spline definition can very conveniently go inside the model formula.

Visualize the spline functions

Code
bind_cols(
  hit |> 
    select(age),
  hit$centered_age |> 
    bs(3)
) |> 
  pivot_longer(-age, names_to = "spline") |> 
  ggplot() + aes(age, value, color = spline) + geom_line()

Model

This model describes the player OPS as a function of age (via the splines) and player. The player name is a random effect in the model.

Random effects in statistics are a good way to include categorical variables with many levels as predictors.

Code
model1 <- lmer(
  ops ~ (1 | name) + bs(centered_age, df = 3),
  data = hit
)

model1
Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (1 | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -8955.141
Random effects:
 Groups   Name        Std.Dev.
 name     (Intercept) 0.06695 
 Residual             0.07500 
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                 0.720701                   0.147092  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                -0.002604                  -0.153229  

We can predict the aging curve for an average player by including only the model’s fixed effects. We can see an increase in performance until the player’s late 20’s and then a steeper decrease in their mid-30’s.

Code
pal <- pal_d3()(10)

mean_player <- hit |> 
  add_prediction(model1, re.form = ~0)

mean_player |> 
  ggplot() + aes(age, pred_ops) + geom_line(color = pal[2]) + 
  labs(title = "modeled aging curve", y = "ops") +
  scale_y_continuous(labels = scales::label_number(accuracy = 0.001))

Model coefficients

Fixed effects

These coefficients are what form the modeled aging curve.

Code
model1 |> 
  tidy("fixed") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  tabulate()

Random effects

The random effect intercept is the +/- of each player’s aging curve relative to the mean player. Unsurprisingly, tabulating these players by decreasing coefficients lists the greatest hitters of the era.

You may wish to sort this table by estimate or search for a player name in the level search box.

Code
model1 |> 
  tidy("ran_vals") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  arrange(desc(estimate)) |> 
  tabulate(columns = list(level = colDef(minWidth = 150, filterable = TRUE)))

Calculating the age of peack performance

To calculate the age of peak OPS, we can numerically solving for the zero of the 1st derivative with respect to age.

Code
predict_ops <- function(age, model) {
  tibble(centered_age = age - mean(hit$age)) |> 
    predict(object = model, re.form = ~0)
}

calc_ops_deriv <- function(age, model) {
  partial(predict_ops, model = model) |> # now a function of only age
    fderiv(age)
}

calc_zero_deriv <- function(model) {
  partial(calc_ops_deriv, model = model) |> # now a function of only age
    fzero(x = 30) |> 
    pluck("x")
}

cat("age of peak performance: ")
age of peak performance: 
Code
model1 |> 
  calc_zero_deriv() |> 
  cat()
26.81941

Aggregation

Instead of performing the more complex steps of engineering the splines and fitting the statistical model, it might be tempting to simply average OPS by player age to see the aging curve.

This requires no statistics, but we’ll quickly find a problem.

Code
grouped_hit <- hit |> 
  group_by(age = round(age)) |> 
  mutate(player_contribution = pa / sum(pa)) 

aggregate_player <- grouped_hit |> 
  summarize(
    across(c(avg, obp, slg, ops), \(x) weighted.mean(x, pa)),
    .groups = "drop"
  ) 
Code
aggregate_player |> 
  ggplot() + aes(age, ops) + geom_line(color = pal[1]) +
  labs(title = "ops aggregated by player age")

This result looks substantially different from the fit aging curve.

That’s because it’s subject to an omitted variable bias. The players at the low age ranges (19, 20) and high (37, 38, …) are all-star and hall-of-fame caliber players. They distort the averages in these age ranges.

Code
tabulate_one_age <- function(ages) {
  grouped_hit |> 
    filter(age %in% ages) |> 
    slice_max(player_contribution, n = 5) |> 
    select(age, name, player_contribution) |> 
    tabulate(
      columns = list(
        name = colDef(minWidth = 150),
        player_contribution = colDef(
          "player contribution", 
          format = colFormat(percent = TRUE, digits = 1)
        )
      )
    )  
}
tabulate_one_age(20)
tabulate_one_age(21)
Code
Code
tabulate_one_age(38)
tabulate_one_age(39)
Code
Code

Model/aggregate comparison

Code
aging <- aggregate_player |> 
  mutate(
    centered_age = age - mean(hit$age), 
    pred_ops = tibble(centered_age) |> 
      predict(object = model1, re.form = ~0)
  ) 

aging |> 
  select(age, modeled = pred_ops, aggregate = ops) |> 
  pivot_longer(-age) |> 
  mutate(name = factor(name, c("aggregate", "modeled"))) |> 
  ggplot() + aes(age, value, color = name) + geom_line() + scale_color_d3() +
  labs(y = "ops", title = "modeled and aggregate ops aging")

A more complex model

By including a random slope for centered_age, we can allow different players to have different aging curves.

The partial pooling of mixed effects models is really nice for this, because it will penalize really extreme values of this slope. To me, this is a desirable feature - to keep the slopes realistic, even for players with a smaller amount of data.

Typically, including a random slope requires also adding a corresponding fixed effect. In this case, however, I will rely on the spline fits to fill the role of the fixed effect. This will maintain the population mean centered_age slope at zero.

Each player will be given a centered_age coefficient which modifies their aging curve.

Code
model2 <- lmer(
  ops ~ (centered_age | name) + bs(centered_age, df = 3),
  data = hit
) 

model2
Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (centered_age | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -9016.766
Random effects:
 Groups   Name         Std.Dev. Corr
 name     (Intercept)  0.066054     
          centered_age 0.005782 0.07
 Residual              0.072703     
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                  0.71586                    0.15025  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                  0.01849                   -0.19515  

Looking at the players with the most extreme values of the centered_age coefficient is interest: it reveals the players whose performance “aged” the fastest (negative values) and slowest (positive values).

Code
cf <- model2 |> 
  tidy("ran_vals") |> 
  filter(term == "centered_age") |> 
  mutate(across(where(is.numeric), \(x) round(x, 4))) |> 
  arrange(desc(estimate)) 

cf |> 
  tabulate(columns = list(level = colDef(minWidth = 150)))

In some cases, players aging faster than average is due to historically high peaks, like Miguel Cabrera.

Code
predict_player <- function(players, model) {
  
  mean_player <- hit |> 
    distinct(age, centered_age) |> 
    mutate(
      mean_player = tibble(centered_age) |> 
        predict(object = model, re.form = ~0)
    )
  
  hit |> 
    filter(name %in% players) |> 
    left_join(cf, by = join_by(name == level)) |> 
    mutate(
      player = tibble(name, age, centered_age) |> 
        predict(object = model),
      name = name |> 
        reorder(-abs(estimate))
    ) |> 
    ggplot() + 
    aes(x = age, color = "player") + geom_point(aes(y = ops)) + geom_line(aes(y = player)) +
    geom_line(aes(x = age, y = mean_player, color = "mean_player"), data = mean_player) +
    scale_color_manual(values = c("player" = "grey30", "mean_player" = pal[2])) +
    facet_wrap(~name + estimate)
}

predict_player("Miguel Cabrera", model2)

Players aging gracefully

Here are the aging curves for the 16 players with the slowing aging. Nelson Cruz and Adrian Beltre have the largest positive slopes, and this fits their reputation as late-bloomers. Jim Thome and David Ortiz remained star players into their late 30’s and early 40’s. Jose Altuve and Freddie Freeman are still active players, but are posting peak stats in their mid-30’s.

Because I only included data from 2005 - 2023 in my analysis, the early years of some player’s careers are missing.

Code
cf |> 
  slice_max(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Players aging rapidly

Here are the aging curves for the 16 players with the most rapid aging. Perhaps speculation on my part, but many of these players were pure power hitters with stout and muscular builds.

In many cases, these were players with All-Star 20’s, but average or below-average 30’s.

Code
cf |> 
  slice_min(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Conclusions

Splines and mixed effects models are a great choice for studying the aging of MLB player performance.

The splines can describe a nonlinear aging curve, and mixed effects models are ideal for working with categorical variables (in this case, player) with many levels.

Including an additional random effect slope for player age elegantly describes a range or spectrum of aging curves, and the results from the model seem consistent with baseball wisdom.