MLB player aging

Author

Arthur Andrews

Goals

Study the aging of MLB player performance
Reveal the bias of estimating aging curves with simple aggregation
Demonstrate a statistics model for aging using splines and mixed effects modeling
Show some players whose aging curves are unusual or exceptional

Purpose

This is a hobby project.

Tools

R - programming language
baseballr - an R interface to MLB data
quarto - publishing data-rich documents from code
GitHub Pages - hosting code and web pages

Code

By default, the R code is displayed. If you wish, you may hide it from the “Code” menu in the top-right.

Dependencies

Code

library(tidyverse)
library(broom.mixed)
library(janitor)
library(splines)
library(factoextra)
library(reactable)
library(ggsci)
library(magrittr, include.only = "set_rownames")
library(lme4, exclude = c("expand", "pack", "unpack"))
library(htmltools, include.only = "css")
library(pracma, include.only = c("fzero", "fderiv"))

theme_set(theme_bw(base_size = 12))

Code

load("data/hitting_stats.RData")

Declare functions

For adding model predictions to a data frame and tabulating results in html tables.

Code

add_prediction <- function(hit, model, ...) {
  hit$pred_ops <- predict(model, hit, ...)
  hit
}


tabulate <- partial(
  reactable,
  fullWidth = FALSE, highlight = TRUE, style = css(fontSize = "80%")
)

Data set

I accessed MLB hitter data from 2005 - 2023 using the excellent baseballr package, which offers convenient interfaces to the Statcast APIs to Baseball Reference data.

I include - seasons with at least 200 plate appearances - and hitters with at least 5 seasons

The data loaded by this notebook was first prepared in a separate script that you may find in the code repository: query_mlb.R.

Code

hit <- hit |> 
  filter(n_seasons >= 5)

Code

hit |> 
  group_by(season) |> 
  summarize(rows = n(), hitters = n_distinct(id)) |> 
  tabulate(defaultPageSize = 15)

Print a glimpse of the data to see the schema and number of rows and columns.

Code

glimpse(hit)

Rows: 4,285
Columns: 13
$ id           <int> 110029, 110029, 110029, 110029, 110029, 110029, 110029, 1…
$ name         <chr> "Bobby Abreu", "Bobby Abreu", "Bobby Abreu", "Bobby Abreu…
$ season       <chr> "2005", "2006", "2007", "2008", "2009", "2010", "2011", "…
$ team_name    <chr> "Philadelphia Phillies", "New York Yankees", "New York Ya…
$ ab           <int> 588, 548, 605, 609, 563, 573, 502, 219, 575, 543, 417, 55…
$ pa           <int> 719, 686, 699, 684, 667, 667, 585, 257, 603, 588, 450, 59…
$ age          <dbl> 31.30732, 32.30664, 33.30595, 34.30801, 35.30732, 36.3066…
$ avg          <dbl> 0.286, 0.297, 0.283, 0.296, 0.293, 0.255, 0.253, 0.242, 0…
$ obp          <dbl> 0.405, 0.424, 0.369, 0.371, 0.390, 0.352, 0.353, 0.350, 0…
$ slg          <dbl> 0.474, 0.462, 0.445, 0.471, 0.435, 0.435, 0.365, 0.342, 0…
$ ops          <dbl> 0.879, 0.886, 0.814, 0.842, 0.825, 0.787, 0.718, 0.692, 0…
$ n_seasons    <int> 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, …
$ centered_age <dbl> 2.158141, 3.157457, 4.156772, 5.158826, 6.158141, 7.15745…

Splines

I will use three B-splines to provide a flexible basis for describing the nonlinear aging curve.

Each of these splines is a smooth function of player age, and by including the split fits as predictor variables in a model, the model will be able to describe a nonlinear relationship between age and OPS. The coefficients will define the shape of the relationship.

With the splines package, the spline definition can very conveniently go inside the model formula.

Visualize the spline functions

Code

bind_cols(
  hit |> 
    select(age),
  hit$centered_age |> 
    bs(3)
) |> 
  pivot_longer(-age, names_to = "spline") |> 
  ggplot() + aes(age, value, color = spline) + geom_line()

Model

This model describes the player OPS as a function of age (via the splines) and player. The player name is a random effect in the model.

Random effects in statistics are a good way to include categorical variables with many levels as predictors.

Code

model1 <- lmer(
  ops ~ (1 | name) + bs(centered_age, df = 3),
  data = hit
)

model1

Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (1 | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -8955.141
Random effects:
 Groups   Name        Std.Dev.
 name     (Intercept) 0.06695 
 Residual             0.07500 
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                 0.720701                   0.147092  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                -0.002604                  -0.153229

We can predict the aging curve for an average player by including only the model’s fixed effects. We can see an increase in performance until the player’s late 20’s and then a steeper decrease in their mid-30’s.

Code

pal <- pal_d3()(10)

mean_player <- hit |> 
  add_prediction(model1, re.form = ~0)

mean_player |> 
  ggplot() + aes(age, pred_ops) + geom_line(color = pal[2]) + 
  labs(title = "modeled aging curve", y = "ops") +
  scale_y_continuous(labels = scales::label_number(accuracy = 0.001))

Model coefficients

Fixed effects

These coefficients are what form the modeled aging curve.

Code

model1 |> 
  tidy("fixed") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  tabulate()

Random effects

The random effect intercept is the +/- of each player’s aging curve relative to the mean player. Unsurprisingly, tabulating these players by decreasing coefficients lists the greatest hitters of the era.

You may wish to sort this table by estimate or search for a player name in the level search box.

Code

model1 |> 
  tidy("ran_vals") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  arrange(desc(estimate)) |> 
  tabulate(columns = list(level = colDef(minWidth = 150, filterable = TRUE)))

Calculating the age of peack performance

To calculate the age of peak OPS, we can numerically solving for the zero of the 1st derivative with respect to age.

Code

predict_ops <- function(age, model) {
  tibble(centered_age = age - mean(hit$age)) |> 
    predict(object = model, re.form = ~0)
}

calc_ops_deriv <- function(age, model) {
  partial(predict_ops, model = model) |> # now a function of only age
    fderiv(age)
}

calc_zero_deriv <- function(model) {
  partial(calc_ops_deriv, model = model) |> # now a function of only age
    fzero(x = 30) |> 
    pluck("x")
}

cat("age of peak performance: ")

age of peak performance:

Code

model1 |> 
  calc_zero_deriv() |> 
  cat()

26.81941

Aggregation

Instead of performing the more complex steps of engineering the splines and fitting the statistical model, it might be tempting to simply average OPS by player age to see the aging curve.

This requires no statistics, but we’ll quickly find a problem.

Code

grouped_hit <- hit |> 
  group_by(age = round(age)) |> 
  mutate(player_contribution = pa / sum(pa)) 

aggregate_player <- grouped_hit |> 
  summarize(
    across(c(avg, obp, slg, ops), \(x) weighted.mean(x, pa)),
    .groups = "drop"
  )

Code

aggregate_player |> 
  ggplot() + aes(age, ops) + geom_line(color = pal[1]) +
  labs(title = "ops aggregated by player age")

This result looks substantially different from the fit aging curve.

That’s because it’s subject to an omitted variable bias. The players at the low age ranges (19, 20) and high (37, 38, …) are all-star and hall-of-fame caliber players. They distort the averages in these age ranges.

Code

tabulate_one_age <- function(ages) {
  grouped_hit |> 
    filter(age %in% ages) |> 
    slice_max(player_contribution, n = 5) |> 
    select(age, name, player_contribution) |> 
    tabulate(
      columns = list(
        name = colDef(minWidth = 150),
        player_contribution = colDef(
          "player contribution", 
          format = colFormat(percent = TRUE, digits = 1)
        )
      )
    )  
}

tabulate_one_age(20)
tabulate_one_age(21)

Code

tabulate_one_age(38)
tabulate_one_age(39)

Code

Model/aggregate comparison

Code

aging <- aggregate_player |> 
  mutate(
    centered_age = age - mean(hit$age), 
    pred_ops = tibble(centered_age) |> 
      predict(object = model1, re.form = ~0)
  ) 

aging |> 
  select(age, modeled = pred_ops, aggregate = ops) |> 
  pivot_longer(-age) |> 
  mutate(name = factor(name, c("aggregate", "modeled"))) |> 
  ggplot() + aes(age, value, color = name) + geom_line() + scale_color_d3() +
  labs(y = "ops", title = "modeled and aggregate ops aging")

A more complex model

By including a random slope for centered_age, we can allow different players to have different aging curves.

The partial pooling of mixed effects models is really nice for this, because it will penalize really extreme values of this slope. To me, this is a desirable feature - to keep the slopes realistic, even for players with a smaller amount of data.

Typically, including a random slope requires also adding a corresponding fixed effect. In this case, however, I will rely on the spline fits to fill the role of the fixed effect. This will maintain the population mean centered_age slope at zero.

Each player will be given a centered_age coefficient which modifies their aging curve.

Code

model2 <- lmer(
  ops ~ (centered_age | name) + bs(centered_age, df = 3),
  data = hit
) 

model2

Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (centered_age | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -9016.766
Random effects:
 Groups   Name         Std.Dev. Corr
 name     (Intercept)  0.066054     
          centered_age 0.005782 0.07
 Residual              0.072703     
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                  0.71586                    0.15025  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                  0.01849                   -0.19515

Looking at the players with the most extreme values of the centered_age coefficient is interest: it reveals the players whose performance “aged” the fastest (negative values) and slowest (positive values).

Code

cf <- model2 |> 
  tidy("ran_vals") |> 
  filter(term == "centered_age") |> 
  mutate(across(where(is.numeric), \(x) round(x, 4))) |> 
  arrange(desc(estimate)) 

cf |> 
  tabulate(columns = list(level = colDef(minWidth = 150)))

In some cases, players aging faster than average is due to historically high peaks, like Miguel Cabrera.

Code

predict_player <- function(players, model) {
  
  mean_player <- hit |> 
    distinct(age, centered_age) |> 
    mutate(
      mean_player = tibble(centered_age) |> 
        predict(object = model, re.form = ~0)
    )
  
  hit |> 
    filter(name %in% players) |> 
    left_join(cf, by = join_by(name == level)) |> 
    mutate(
      player = tibble(name, age, centered_age) |> 
        predict(object = model),
      name = name |> 
        reorder(-abs(estimate))
    ) |> 
    ggplot() + 
    aes(x = age, color = "player") + geom_point(aes(y = ops)) + geom_line(aes(y = player)) +
    geom_line(aes(x = age, y = mean_player, color = "mean_player"), data = mean_player) +
    scale_color_manual(values = c("player" = "grey30", "mean_player" = pal[2])) +
    facet_wrap(~name + estimate)
}

predict_player("Miguel Cabrera", model2)

Players aging gracefully

Here are the aging curves for the 16 players with the slowing aging. Nelson Cruz and Adrian Beltre have the largest positive slopes, and this fits their reputation as late-bloomers. Jim Thome and David Ortiz remained star players into their late 30’s and early 40’s. Jose Altuve and Freddie Freeman are still active players, but are posting peak stats in their mid-30’s.

Because I only included data from 2005 - 2023 in my analysis, the early years of some player’s careers are missing.

Code

cf |> 
  slice_max(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Players aging rapidly

Here are the aging curves for the 16 players with the most rapid aging. Perhaps speculation on my part, but many of these players were pure power hitters with stout and muscular builds.

In many cases, these were players with All-Star 20’s, but average or below-average 30’s.

Code

cf |> 
  slice_min(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Conclusions

Splines and mixed effects models are a great choice for studying the aging of MLB player performance.

The splines can describe a nonlinear aging curve, and mixed effects models are ideal for working with categorical variables (in this case, player) with many levels.

Including an additional random effect slope for player age elegantly describes a range or spectrum of aging curves, and the results from the model seem consistent with baseball wisdom.