Project Title

Back to Research & Publication

Cities’ Effects on an NBA Player’s Performance



Introduction

Many factors go into the performance of an NBA player on a given night. As NBA players on the road have a lot of free time, the sites and attractions various cities have can be very distracting with an upcoming game. For my project this semester, I am planning to analyze the correlation between a player’s performance and a cities activity level. To classify a player’s performance, I went through the box score data and categorized each game in each city as average, above average, subpar, and anomaly’s. From there, cities were ranked and given an activity level value based on various factors of the city. Finally, I performed a linear regression with players performance and city activity level for each player position in the nba.

Data Collection

To collect the data for this project, I split the task into two parts:
1) Player Data Collection
2) City Activity Data Collection
For the player data, I used both box score data which has the box score for every player in every game since 2003-2019, and also player average data, which shows every player’s average for each season from 2003-2019. Both of these datasets were csv’s provided by the NBA. For the city activity levels, the data was collected through the Yelp API and through web scraping tripadvisor.com. With the Yelp API, I took the average rating of all restaurants, bars, and clubs in all NBA cities. When web scraping tripadvisor.com, I looked to find all the attractions for a given city through the attractions part of the website. Both of these factors were used when calculating the cities activity level.

Data Cleaning and Processing

1) Player Performance
To identify a player's performance in a given game, the metrics used were determined by the position the player plays. There are five positions in basketball, Point Guard, Shooting Guard, Small Forward, Power Forward, and Center. In order to determine whether a player of a certain position had a good game, the statistics that they are analyzed by are different from the statistics a player of another position is analyzed by. For example, assists are a very important statistic identifying a point guard’s performance, while it is unnecessary for a center. Therefore, I assigned each player relevant statistics based on their position.

  • Point Guard - PTS, AST, TOV, FG%, STLS, +/
  • Shooting Guard - PTS, FG%, 3 PT%, TOV, +/-
  • Small Forward - PTS, 3PT%, FG%, TOV, +/-
  • Power Forward - PTS, REB, FG%, BLK,FOULS, +/-
  • Center - PTS, REB, FG%, BLK, FOULS, +/-
Using these statistics, the method for determining their performance in a certain game was by calculating the z-score of each statistic based on the player’s season average. Then, average all the z-scores for every statistic. This is the formula for determining a player's relevant statistics.

\[P=\frac{1}{n}\sum_{i=1}^n\frac{RS_i-\overline{RS_i}}{\sigma_{RS_i}},~RS_i~is~the~i^{th}~relevant~stat\]

Then, I categorized each player performance as average, above average, subpar, and anomaly (positive and negative).
Average: -1 < P < 1
Above Average: 1 < P < 2
Subpar : -1> P > -2
Anomaly: |P| > 2
Finally, for each city and each position, the data is presented as the number of games that fall into each of the four performance categories.

2) City Activity Level
I split the city activity level into two parts.

  • Restaurant, Bar, and Club ratings
  • Number of attractions
For the Restaurant, Bar, and Club ratings, I used the Yelp API to find the average rating of all of these places for each city multiplied by 0.2. The average rating was between 0-1.
For the number of attractions, using tripadvisor I divided every city's number of attractions by the city with the highest number of attractions. Therefore, each city will have a number of attractions metric between 0 - 1.
At the end, I added these two metrics to get a city activity level metric between 0-2

Results

Once I gathered information about each city’s bars, clubs, restaurants ratings as well as the number of attractions in each city, I calculated and ranked each city by their city rating. Los Angeles was the highest with a rating of 1.89 and Oklahoma City was the lowest with a rating of 0.69.

CityRating
Los Angeles1.89
New York1.86
Miami1.82
Washington D.C.1.77
Chicago1.73
Brooklyn1.72
Toronto1.64
San Francisco1.61
Houston1.52
Boston1.48
Philadelphia1.43
Dallas1.41
Orlando1.38
Atlanta1.32
Detroit1.21
Phoenix1.16
New Orleans1.13
Charlotte1.05
Indianapolis1.03
Denver0.98
Minneapolis0.91
Cleveland0.88
Salt Lake City0.85
Sacramento0.83
Portland0.82
Milwaukee0.75
San Antonio0.73
Memphis0.71
Oklahoma City0.69

Once the different cities were ranked by city activity ratings, and the data for each player position’s performance in each city was calculated, I created 20 different linear regressions where the dependent variable was percentage of games of a certain performance category for a specific position in each city, and the independent variable was the city rating.

Of the 20 different linear regressions that were run, the two that came with the most correlation were:

To see if there is a true linear correlation in these two sets of data, I performed a hypothesis test on both of them.

Percentage of Above Average Games for a Point Guard

x = City Rating
y = Percentage of Above Average Games for a Point Guard
Let \(y=\beta x+\alpha\) be the linear regression equation
\(\text{Null Hypothesis }(H_0):\beta=0\)
\(\text{Alternate Hypothesis }(H_a):\beta\not=0\)
\(\text{p-value}:0.05\)

\(\text{Degrees of Freedom}:27\)
\(\text{Regression Equation}:y=-0.13x+0.566\)
\(\text{Standard Deviation}:s_b=0.038\)
\(\text{Test Statistic}:t=\frac{b-\beta}{s_b}=-3.39\)

Since the t-value for a two-sided t-test with a 0.95 confidence is \(\pm 2.05\), and \(t<-2.05\), We can reject the null hypothesis, meaning that there is a significant linear correlation between point guard above average games and city rating.

Percentage of Anomaly Games for a Center

x = City Rating
y = Percentage of Above Average Games for a Point Guard
Let \(y=\beta x+\alpha\) be the linear regression equation
\(\text{Null Hypothesis }(H_0):\beta=0\)
\(\text{Alternate Hypothesis }(H_a):\beta\not=0\)
\(\text{p-value}:0.05\)

\(\text{Degrees of Freedom}:27\)
\(\text{Regression Equation}:y=0.043x-0.0203\)
\(\text{Standard Deviation}:s_b=0.0131\)
\(\text{Test Statistic}:t=\frac{b-\beta}{s_b}=3.359\)

Since the t-value for a two-sided t-test with a 0.95 confidence is \(\pm 2.05\), and \(t>2.05\), We can reject the null hypothesis, meaning that there is a significant linear correlation between center anomaly games and city rating.

From these hypothesis tests, we can confirm that there is a significant positive correlation between Center Anomaly Games and City Rating, as well as a negative correlation for Point Guard Above Average Games and City Rating.

Conclusion

Although certain trends were seen between city activity levels and player performance, there are certain parts of the methodology that can be changed to provide a more accurate result. For example, the opponent team in each city was not accounted for when measuring a player’s performance rating. This has a great impact on the results, because cities with strong teams will naturally have a lower performance rating among players who play in that city. In the future, I plan to account for this by putting the team's record as an independent variable. Another future application of this project would be using the current data acquired to train a model to predict a player’s performance based on the city they are playing in and the team they are playing against.

Semester

Spring 2020

Researcher

Jai Sankar