Are Upsets More Likely Between Division Rivals in the NFL?

Ben Namovicz

Introduction

In week 14 of the 2018 NFL season, the 6-6 Miami Dolphins beat the 9-3 New England Patriots in miraculous fashion. The Patriots would go on to win the superbowl that season while the Dolphins didn't even make the playoffs. The very next year, the 4-11 Dolphins beat the 12-3 Patriots in another big upset. Taken alone these games are a remarkable story, but they also fit into a common narrative in sports.

It is often claimed that division rivals have an easier time pulling off upsets than out of division opponents. The two previously discussed in-division upsets would seem to support this claim, but does this claim hold up to rigerous data analysis? In this tutorial we will use Python's data analysis tools to put this claim to the test.

Contents

Background Information

There are 32 teams in the NFL. These teams are divided into 2 conferences of 16 teams each. The conferences are further subdivided into divisions of 4 teams each, leaving the NFL with 8 total divisions. These divisions can be seen below:

AFC:

East West North South
Buffalo Bills Denver Broncos Baltimore Ravens Houston Texans
Miami Dolphins Kansas City Chiefs Cincinnati Bengals Indianapolis Colts
New England Patriots Las Vegas Raiders Cleveland Browns Jacksonville Jaguars
New York Jets Los Angeles Chargers Pittsburgh Steelers Tennessee Titans

NFC:

East West North South
Dallas Cowboys Arizona Cardinals Chicago Bears Atlanta Falcons
New York Giants Los Angeles Rams Detroit Lions Carolina Panthers
Philadelphia Eagles San Fransisco 49ers Green Bay Packers New Orleans Saints
Washington Football Team Seattle Seahawks Minnesota Vikings Tampa Bay Buccaneers

An NFL team plays 16 games in a season. 6 of these games are in division games, so a team will play every team in it's division twice every year. The other 10 games are played out of division, and the matchups for these games change every year. A team will never play an out of division opponent more than once in a season. This creates a situation where a team is much more familiar with teams in their divison. This is where the logic for the claim we are testing comes from. If you face the same team twice every year, you should know how they work, and this should level the playing field between good teams and bad teams.

Data Collection

In order to test our hypothesis we will definitely need records of who wins and loses each game. We will also need some measure of what counts as an 'upset'. The most obvious way to quantify upsets would be to use win-loss records, but this is flawed for two big reasons:

  1. Teams start at 0-0 each season, so early season games will have to be thrown out
  2. Records can be decieving. A 5-4 record against good teams should be better than a 6-3 record against terrible teams.

Luckily there is an approach that solves both of these problems. The Elo rating system Is a method of ranking teams based on their performance over time. For this project we will use 538's Historical NFL Elo Ratings. The complete dataset, which includes Elo scores and game results, can be found here. An in depth explaination of how the Elo ratings are calculated can be found here. For our purposes we just need to know that a team's Elo score is a representation of how good they are with better teams having higher scores. As we will see [later], these scores can be used to calculate predicted probability of winning or predicted point spread.

Now let's get started by importing some modules. Here are the modules we will be using:

Elo Data

Next we read in our 538 NFL Elo data. This dataset records the date, teams, Elo ratings, and results of every NFL game.

Divisions

This dataframe only uses 3-letter team ids to identify teams. To Analyze division relationships with these same ids we will have to create our own dataframe. This sounds like a lot, but there are only 32 teams to worry about. You can also see the NFL divisions here.

Data Processing

Before we can analize the data, we need to clean it up. We only want to look at regular season games, which are denoted in the data by leaving the playoff colum blank. Here we mark those games with 'n/a' and then select only games with 'n/a' in the playoff column.

Next we remove a lot of coulmns that are not useful for our analysis. An important question here is which elo values to use. The dataset provides both a standard Elo and a qb adjusted Elo. Quarterbacks are usually considered the most impactful player on a football team, so 538 creates a qb rating system and then adjusts each team's Elo score to account for their starting quarterback. This becomes very relevant if a quarterback was injured, as the team's Elo will immediatly drop to account for the worse backup qb instead of having to wait for the team to lose to adjust the rating. The qb adjusted Elo is considered a more accurate assessment of a team's overall strength so it is what we will use for this analysis.

The last thing we need to do to clean our data is to restrict the data to years where divisions are the same. According to Wikipedia, the divisons of the NFL were realigned in 2002, so any data before that would not match our divisions dataset. We will also exclude the 2020 season as it is still ongoing.

Next we need to identify which games are in division and which are not. To do this we define a helper function to check whether two team ids belong to the same division. We apply the function to our elo dataset.

Finally, we would like to have a predicted probablilty and point spread for future analysis. Luckily our dataset already comes with a probability, but we will have to calculate score margin ourselves. 538 gives a formula for predicted margin based on Elo scores.

$\hat{margin} = \frac{elo_1 - elo_2}{25}$

We will also calculate actual margin by simply subtracting actual scores. We will also record the result of the game as a 1 or a 0. A result of 1 means team 1 won, a result of 0 means team 2 won, and a result of .5 means the teams tied.

An important fact about our predictions is that they are blind to divison. That means a game predictions is the same for games with the same elo difference regardless of whether they are in division. This will allow us to test our question because if upsets are more common for in division games we would expect to see less accurate predictions in division. This is because for an upset to occur, the prediction has to be wrong.

Data Analysis

Now that we have clean data we can preform a variety of visualizations and hypothesis tests to better understand it.

Data Visualization: Win Margins

The first visualization will test how good our predicted margin is. We can plot a simple scatterplot of our prediction against the actual result. It is important to verify that our predictor is effective because we are using it to define 'underdog' and 'upset'. The validity of our results depends on knowing that our evaluations of team strength are meaningful.

We can see that teams tend to do better with higher expected margins, but there is a lot of variation. This should be expected as football game scores are pretty hard to predict. Now we will plot the same plot, but color coded based on division. This is achieved by plotting two subplots on top of each other. This plot only shows two seasons, as any more make it crowded and hard to read.

There isn't any obvious difference based on division in this plot. If in-division games were more likely to end in upsets, we would expect to see a weaker correlation between predicted margin and margin for in division games, but that does not appear to be the case.

Data Visualization: Win percentage

Win margins are a great way to test prediction accuracy, but at the end of the day all that matters in football is who wins the game. A good prediction should be able to accurately describe a team's chances of winning a game. Ideally teams given a 60% chance of winning should win 60% of games. We can test this directly by putting the data into buckets, and caluclating win percent for each bucket. For example, what percent of teams given a 35%-40% chance to win end up winning? Hopefully about 35 or 40 percent. Below we plot this for with 20 equally sized intervals of 5%. In order to assign buckets to each data point we use pandas.cut, and then use pandas.groupby to calculate the win percent of each bucket. We will plot these win percents in a bar chart.

This plot looks pretty good for our predictions. Teams win about as often as they are expected to. Now we again check for a difference between in division and out of division games. This time we achieve the two datasets on one plot by seperately calculating the answer for in division and out of division data. Then we join the two datasets and plot them in one chart.

Once again there don't seem to be any obvious differences between our datasets. The data looks a bit noisy, but there are no clear trends. If teams in the same division are more likely to pull off upsets we would expect to see in division teams with low probabilities win more than out of division teams with low probabilities, and vice versa with high win probabilities. Once again we don't really see this in the data.

One piece of information our plots leave out is how many games are included in each bucket. As we will see below, our predictor is relatively cautious and doesn't give many games extreme probabilities like >90% or <10%. To show the distribution of predictions we will make a plot where the height of bars represents how many games are in each bucket. The bars are colored based on how many of those games are wins or losses. To create this plot we will plot the number of wins in each bucket on top of the number of games.

Prediction accuracy

A good way to evaluate our predictions is to test accuracy. This is a very simple measurement that asks whether the team our prediction gave a higher chance of winning ended up winning. It is useful because it has the clearest definintion of an upset: the team that was expected to lose won. If we could only do one test to answer our original question this would be it.

It turns out in division games are slightly more likely to end with the underdog losing than out of division games. The difference is very small, so it might be noise (we will test this more rigorously below). Whether it is noise or a meaningful relationship, It seems pretty safe to say that upsets are not more common for in division games. The favored team wins 63% of the time in division and 62% of the time out of division.

Hypothesis Testing

Our results above left an important question unanswered: is the difference we noticed statistically significant? Put another way, could the difference in underdogs winning be plausibly explained by random chance or should we conclude that there is a meaningful relationship between division and upsets? In order to test this it will be helpful to have a different definition of accuracy.

Below we define two functions to quantify error, where accuracy is understood as low error. Logistic error is used to quantify the accuracy of binary predictions. In this case we are checking how accurate we are in predicting which team wins. Square error is used to quantify the accuracy of continuous prediction. In this case we will be chacking how accurate our point margin predictions are.

Now it is time to conduct hypothesis testing. The two questions we will be testing are:

  1. Can differences in accuracy between in division and out of division game predictions be explained by random chance?
  2. Can differences in accuracy between in division and out of division point spreads be explained by random chance?

These questions might be a bit confusing, so let's step back and see what precisely we will be doing:

The T-Test

When we calculated accuracy we saw that in division predictions were slightly more accurate than out of division predictions. We want to see whether that difference is statistically significant. If a difference is statistically significant, that means it would be very unlikely to see the difference we see without an underlying cause. If the difference is not statistically significant, that means the difference could easily be explained by random noise in the data.

To test this we are using a t-test. The t-test calculates the p-value, which represents the probability of seeing a difference at least as big as we see given that there is no actual relationship. The possibility that there is no relationship between division and accuracy is called the null hypothesis and the possibility that there is a relationship is called the alternative hypothesis. If we find that there is a high chance that we would see data similar to the data we see given no relationship, we say that we have failed to reject the null hypothesis and cannot conlude that there is any relationship between accuracy and division at all. If there is a very small chance of seeing data like we see given no relationship, we say that we have rejected the null hypothesis and conclude that there is a relationship. The most commonly used thershold for statistical significance is 5%. So if p > 5% we fail to reject the null hypothesis, but if p < 5%, we reject the null hypothesis.

We will be conducting two t-tests: one for predicted game results and one for predicted margins. In both cases we will be comparing errors of in-division games to errors of out of divison games. Higher errors mean predictions are less accurate, which means upsets are more common.

In both cases, our p-value is above 5%, so we fail to reject the null hypothesis. That means we cannot conclude that there is any relationship between division and accuracy of predictions.

Prediction

The last thing to do with our data is to predict other data. We will use logistic regression and linear regression on Elo scores ignoring division and then including division. To evaluate our regression we will use ten fold cross validation#k-fold_cross-validation). This randomly divides our dataset into 10 part. The regression model is trained on 9 of the parts and then tested on the 10th. This is repeated 10 times so that the model is tested on all 10 parts of the data. We then average the score from the testing across all 10 parts.

So what is the score? For logistic regression we use accuracy% as a score. This is exactly the same accuracy calculation we did earlier: how often did the favored team win? For linear regression we use R-squared as the score. R-squared determines how much varience in y is explained by X. For example, if R-squared = 40 then we can explain 40% of the varience in teams winning from their Elo scores.

Before we start regression, we have to prepare the data we will be regressing on:

Now we can run our first regression. Here we run logistic regression using only the two team's Elo ratings

Next we run the same regression but in addition to Elo ratings we also train on whether the teams are in the same division

The accuracy of our predictor is almost exactly the same with or without knowledge of division. This means that knowing whether two teams are in the same division is completely useless when predicting who will win using our logitic regressor.

Something else to note is that our accuracy of 65% is only slightly better than the 63% accuracy we got from just using the built in Elo formula. This might suggest that the built in formula is as good as you can do with Elo ratings. An interesting follow-up project would be to check whether the difference in accuracy between the built-in Elo prediction formula and linear regression is statistically significant.

Now we move on to linear regression on margin of victory. We are using the same features to predict the margin, but because margin is a continuous value we use linear regression instead of logistic regession.

Once again we will repeat the linear regression, this time including division information.

Just like with logistic regression, there is basically no difference between our regressor with or without division information. It is interesting that we were able to achieve 65% accuracy in predicting game results, but we can only explain 14% of variation in score margin. This suggests that score margin is highly varied and hard to predict, something we saw earlier in the scatter plot of predicted margin vs margin. We never calculated R-squared for Elo's built-in margin predictor, but it would be an interesting follow up project to do that and test whether our linear regression is significantly better than that.

Conclusion

The question we set out to answer at the beginning was whether teams were more likely to pull off upsets against in-division opponents than out of division opponents. We used data visualizations, accuracy calculations, hypothesis testing, and machine learning to find an answer. In all cases we got the same result: There is pretty much no relationship between whether a game is played inside a division and whether the underdog wins. This may seem like a disappointing result, but it is a useful result. Sometimes our intuition and common knowledge is wrong. When conventional wisdom is wrong about something like this we need a way to rigerously determine the truth. This is what data science is for!

During our exploration of this question we found some other questions that can be expanded upon in future analysis. Our Elo ratings come with a built in method for calculation win probability and points margin. Using machine learning we got what appeared to be slightly better results. Are these results actually better, or is this just random noise? We would need to do hypothesis testing to find out. What about other methods of machine learning? Sk-learn has lots of options to choose from.

There are a lot of other, more general questions we can also ask: If division doesn't influence the results of games, is there anything else that does? Is it possible to explain more than 14% of the varience in score margin? Can we beat 538's predictions over the long run?

Hopefully the knowlegde you gained from this tutorial will be useful should you try to answer these questions, or any other questions you need data science to answer.