About Data Dunk

About the Author

Built and maintained by Chase Allbright. Data Analyst at Booz Allen Hamilton with an MS in Computer Science from Troy University.

What Is This?

Data Dunk is my attempt at predicting NBA season awards and outcomes using machine learning. More specifically, it predicts MVP voting and how teams will fare in the playoffs. All models are trained on historical stats going back to 1980. The stats were acquired from basketball-reference.com. The data is updated regularly and the "Last updated" date on each page tells you when it was most recently updated.

The Models

MVP Model

The MVP model is a Random Forest binary classifier. It uses 100 trees, balanced class weights, and is trained on 46 seasons of player data (24,000+ player seasons). Its goal is to identify the probability a player wins the MVP award based on their regular season stats.

It uses 39 input features: the per-game stats you'd expect as well as advanced metrics like PER, BPM, VORP, Win Shares, TS%, and usage rate. In addition to the player stats themselves, team stats are included like net rating, margin of victory, SRS, win percentage, and strength of schedule. This is due to a high emphasis by voters on winning.

Backtesting was done by including only data for years before the prediction, so no predictions improve as more MVPs are named. I've got the ability to predict as far back as 1981, but the more recent the better, which is why I've only included the results of 16 seasons + the current season. The model got the MVP right in 13/16 seasons. In all but 1 (Russell Westbrook), the actual winner finished top three in voting.

Championship Model (Two-Stage Pipeline)

This is more involved. It has 2 stages and runs in a sequence.

Stage 1 handles playoff qualification. An XGBoost classifier, trained on all 30 teams for each season and spits out a playoff probability for each team. The top 16 move on to the next stage. Accuracy is actually really good when predicting the top 16 teams (the playoff teams). The play-in tournament is not factored in.

Stage 2 is where the Finals and championship predictions happen. It's a soft-voting ensemble, weighted 60/40 between XGBoost and Random Forest, and it only trains on teams that actually made the playoffs. It outputs two probabilities for each team: one for reaching the Finals and one for winning it all. Each team is considered independently, so while the predicted probabilities are low, the order is more important. Consider at the beginning of the season if you were to predict each teams' likelihood of winning the Finals without looking at any of the other teams. To give everyone a fair shake, you'd likely give your favorite maybe a 25% chance to win it all, due to injuries, team collapse, trades, etc. That's how this model is predicting.

49 features go into this model. Offensive and defensive counting stats, advanced team stats, and opponent stats vs that team.

One thing to mention is that all of the counting and rate stats get z-score normalized within each season, removing era drift (e.g. the 3pt revolution in the mid 2010s, or the dead ball era), and ensuring that teams are only compared against their competition, not history as a whole. I tried not doing this and the models ended up favoring older stat lines (think Kareem or even Kobe) instead of looking at how players play now. Training samples also get weighted two ways: by recency and by inverse class frequency, ensuring more recent championships count more than other teams.

I'll be honest, accuracy for the finals predictor isn't the best. I'm working to improve it, but I do take solace in the fact that it's at least making reasonable guesses. For example, in 2024-2025, the model predicted Cleveland to make the conference finals, and I believed that too after their hot start. They ended up dealing with injuries and just general choking, Leading to an incorrect result. I acknowledge that the result is incorrect, but at least its defensible.

A Note on Probabilities

These models output probabilities. That's it. A team at 70% to win the title is still losing three out of every ten times you run it. The Accuracy page has the actual track record if you want to see how the models have done on completed seasons. It also considers probabilities in a vacuum, not accounting for other teams. No matchup information is included.

With the MVP voting, this is not the percentage of votes a player will get (Vote Share). I run non-public models on that, but this probability approach is more easily understandable, so I stick to publishing those results.