2017 NBA MVP race Vol. 1: Linear Regression says James Harden!

MVPTwo weeks into the playoffs there is more and more chat about who will win the 2016/2017 NBA regular season MVP award. Different calculations and estimations are used to determine who the player who deserves it the most is, but in the majority of cases the same four or five names come up, and not necessary in the same order: James Harden, Russell Westbrook, Stephen Curry, LeBron James and Kawhi Leonard.

In this blog post we are presenting the results of our first MVP data analysis, which was conducted to determine who had the highest chances to win it this season. We combined different basketball-reference.com data into a dataset with the most relevant regular and advanced statistics for all players who received at least 1 vote in MVP votings between 1979/1980 (introduction of a 3 point shot) and 2015/2016. This time we also tried to establish how effective and accurate linear regression could be for predicting outcomes in NBA basketball.

Firstly, we wanted to examine which variables represent the biggest differences between MVPs and other candidates who were behind them in voting. The most significant differences were in Win Shares, both Offensive and Defensive, as well as Box Plus/Minus, also both Offensive and Defensive. The other notable discriminating variables were 3-point Attempt Rate, Winning Percentage (team) and Assist Percentage, however there were no interpretable differences in Free Throw Attempt Rate, Field-goal Percentage or Block Percentage.

differences_MVPs_others

Secondly, we carried out linear regression analysis. Regression analysis is a statistical process for estimating relationships among variables in a model, with focus on the relationship between a dependent/response variable and independent/explanatory variable(s). In the NBA, MVP voting systems have changed over time, therefore we decided to use Share of maximum possible number of MVP points (a ratio variable, range 0.0-1.0) as the response variable. However, because there were much more players with low number of votes than those with high number of votes, the distribution was skewed and we had to compute log10(x) of that variable to create a new dependent variable with closer to normal distribution. We also had to pay attention to the other assumptions of multiple linear regression, namely linear relationships, homoscedasticity and no multicollinearity. While we didn’t notice any significant issues when testing for the first two assumptions, we were not able to use all regular and advanced player statistics in the model due to multicollinearity. The following variables had very high correlation coefficients with the other listed variables:

  • Win Shares Per 48 was not used in linear regression analysis because of being highly correlated with Win Shares and PER (Hollinger);
  • Total Rebounds Percentage and Total Rebounds were not used because of being highly correlated with Offensive and Defensive Rebounds Percentages;
  • Because of multicollinearity, advanced statistics AST%, STL% and BLK% were used instead of more common Assists, Steals and Blocks statistics;
  • Offensive and Defensive Win Shares were not used because of being highly correlated with Win Shares and Defensive Box Plus-Minus statistics, respectively;
  • Value over Replacement Player was not used, but we kept Box Plus-Minus instead.

As we can see, there are many different constructs/indexes used to measure effectiveness and contribution of a player, but many of them are very similar measurements, based on the same or similar combinations of basic basketball statistics. In the end, 21 explanatory variables were used in the regression analysis. In addition to those listed above (for multicollinearity), we also used Games Played, Minutes, Field-Goal Percentage, 3-point Field-Goal Percentage, Free-Throw Percentage, Winning Percentage (team), True Shooting Percentage, Three-Point Attempt Rate, Free-Throw Rate, Usage Rate, Offensive, Defensive and “Total” Box Plus-Minus. Also, out of all players we selected only those who collected at least 5% of the maximum number of MVP voting points, in order to avoid including players who received votes from a few (possibly biased) journalists only.

Our multiple linear regression analysis (stepwise method) revealed that there are statistically significant relationships between log10(x) of Share of maximum possible number of MVP points (outcome variable) and the following seven explanatory variables:

  • Winning Percentage (team) (standardized beta coefficient: 0.530)
  • Minutes Player per game (standardized beta coefficient: 0.323)
  • PER (standardized beta coefficient: 0.587)
  • Turnover Percentage (standardized beta coefficient: 0.194)
  • Three-point Percentage (standardized beta coefficient: -0.220)
  • Free Throw Percentage (standardized beta coefficient: 0.232)
  • Total Shooting Percentage (standardized beta coefficient: -0.142)

It comes to no surprise that players on good teams with higher PERs have better chances to receive more MVP votes. The fact that players who play more minutes have better chances of winning the award is quite amusing and, in combination with a high standardized beta coefficient for PER, shows that being valuable is not only about the efficiency, but also about the ability to stay on the floor longer and the total contribution in extra minutes. On the other hand, an explanation of beta coefficient values of the other variables seems more challenging.

In order to predict the number of MVP points in the 2017 NBA voting for our MVP candidates, we have to use unstandardized beta coefficients. The following formula can be applied for estimation:

log10(1000 * Share of maximum possible number of MVP points)= -3.172 (Constant) + 2.344 * Winning Percentage + 0.053 * Minutes Player per Game + 0.077 * PER + 0.026 * Turnover Percentage – 0.638 * Three-point Percentage + 1.106 * Free-Throw Percentage – 1.569 * Total Shooting Percentage

Since this year there are 100 voting journalists, 1000 points is the maximum possible number of points (100 * 10 first place points), hence the multiplier 1000 in the log10 equation. Using this equation, we can estimate the number of points that our MVP contenders would collect – James Harden would collect 484 points, Russell Westbrook 415, Stephen Curry 312, Kawhi Leonard 290 and LeBron James 206. However, since there are a total of 2600 points available (100 journalists * (10+7+5+3+1)), the sum of 1394 points for the top 5 MVP contenders seems a bit too low. In the past the top 5 MVP contenders collected between 90 in 95% of all points, so it seems safe to assume that our top 5 2017 MVP contenders will collect approximately 2400 points. In that case, just for the sake of simplicity, we multiply those players’ scores with the same coefficient and the following points are estimated:

MVP_voting_points_estimation

To sum up, the final estimation actually seems quite plausible. This record breaking season (see our previous blog post) there were quite a few players who really stood out, therefore it is safe to assume that the differences will be relatively small and that nobody will repeat Curry’s 2016 maximum number of 1st place votes (131) record. However, our regression model is far from being perfect – R value equals 0.745 and Adjusted R Square value equals 0.541; it means that only 54.1% of response variable variation is explained by our linear model. That could be one of the reasons why the points calculated with the equation were lower than expected. We also noticed that not-explained variation lead to smaller estimated point differences between players in comparison to the point differences in practice. However, the model proved to be moderately accurate if used for estimations in past seasons. From the 1994/1995 season on, the model correctly picked 16 out of 22 MVPs and, on average, 3 players in the top 4 in MVP voting. We believe that the best use for this model would be to do initial listing of the top 5 MVP contenders; and then additional analysis, e.g. logistic regression, social media analysis (similar to Twitter and crowd-sourcing analysis) or even search term analysis (Google Trends) could be carried out to make more accurate estimates. And that is something that we actually intend to do in the upcoming weeks. So stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s