Basketball Analytics. Data Mining for Acquiring Performances

. Choices of decision makers in a basketball team are not limited to the strategies to be adopted during games. The most important ones are outside the ﬁeld and concern team composition and talented and productive players to acquire on which the team can rely to raise its game level. In this paper, we propose to use data mining tasks to help decision makers to make appropriate decisions that will lead to the improvement of the performance of their players and their team. Tasks such as clustering, classiﬁcation and regression are used to detect weaknesses of a team; best players that can help overcome these weaknesses; predict performance and salaries of players. These will be done on the NBA dataset.


Introduction
Data mining (DM) is used in many fields and the sports industry is no exception.In fact, sport is an ideal area for the implementation of data mining tasks and techniques.This is due to the amount of statistics collected on players, teams, games and seasons.Team managers wish to understand these data in order to extract information that will help them improve performances of their players and their teams by predicting future performances and affecting players in appropriate training group to make better profits.The Basket ball is one of those sports.A basketball team is composed of five players.A player can have a great impact on the efficiency of the team due to his talent.Indeed, in many cases, a basketball team with new acquisitions has seen its game significantly improved.Sometimes, the reverse phenomena can take place and observers might say that the added talent could not adapt his game to the rest of the team.And in other cases, one has seen teams with modest talents exceeding expectations.The difficulty of finding talents that are coherent with the rest of the team and the uncertainty of the stability of the acquired players performances amplify the risks taken by the team managers in their choices; especially if the league to which they belong is as competitive and exigent as NBA.
Please note that the LNCS Editorial assumes the western naming convention, with given names the structure of the names in the running heads and the author index.
NBA is one of the three major leagues in the United States.The results and statistics that it provides make it an ideal subject of study.Indeed, the amount of existing data is accessible and concerns over 1200 games played in a single season (Sports Reference, 2000).Moreover, the measured performances of the players, relative to their different positions, are available.To extract significant knowledge from this amount of data, DM tasks such as clustering, classification, regression can be used.In this work, we have two objectives: 1.Our first goal is to study acquisition of new players.Before acquiring new players, it is necessary to determine team's needs.For this purpose, a clustering of the players is done according to their performances.It will give managers a description of players performances levels and hence allow detecting weaknesses that will help in deciding which talent to acquire.2. The second goal is to insure the stability or performance improvement of the chosen player.This can be done by predicting his future performances and cost according to his efficiency.All the proposed tasks have been tested on the data of 10 NBA seasons (from 2005-2006 to 2014-15).
The rest of the paper is organized as follows: section 2 presents related work in the literature.In section 3, the data used to test different tasks is described.Then in section 4, we present some DM tasks and techniques to be used for Basket ball and run them on NBA dataset in section 5.

Related works
The NBA is one of the major leagues of the US.It represents an extremely competitive environment for players and teams.Moreover, it is the subject of intensive research and perfect for the application of DM due to the amount of data generated by multiple meetings and sport events.In Basketball, it was Dean Oliver who introduced statistical basketball analysis in his book "Basketball on Paper" [4].Nowadays, it is regarded as a revolution for the Basketball statistical analysis.The literature concerning Basketball analysis is quite abundant.Among these works, one can find those of [5].They were interested in identifying the variables which had a potential effect on players performance from a game to another, using linear regression.They, first, used a linear mixed model (LMM), to model Points (P ) according to variables with both fix and random effects.Then, they used a generalized linear mixed model (GLMM) to model Win Score (WS ).They concluded that minutes played, percent utilization, and team quality difference variables were among the ones that had an impact on P and WS.
[1] studied the impact of a group of two or three players (called Big 2's or Big 3's ) on successive wins of a team.To this end, [1] grouped players, according to their level, in appropriate groups of two or three, using k-means algorithm.This clustering is done according to some features as points per game, offensive rebounds per game, defensive rebounds per game, assists per game, ect.A regression model was then used to measure the impact of the composition of "Big 2s" and "Big 3s " on winning teams.He showed that the composition of a teams top 2 and top 3 players is a factor with a high statistical significant in the success of a team, and showed which combinations yielded over-performance, and which combinations yielded underperformance, relative to the teams talent and coaching quality.
The successive victories of teams have also been studied in [8], where their income and victories were evaluated according to the performance of their players.A regression method was used to study the relationship that may exist between PER (Player Efficiency Rating) and variables that determine teams wins and earnings.He concluded that the defensive abilities contributed to win games and thus, ensured greater income, knowing that every win brings 3% more revenue.Among the existing works, some focused on prediction.Among them, [6] studied which teams would make the NBA playoffs.They collected and analysed team data using Principal Components Analysis to reduce the dimensionality of the data set and then a Discriminant Analysis to predict the classifications of teams into playoffs or non-playoffs.In their paper, NBA Oracle [2], supervised and unsupervised learning methods are applied for predicting game outcomes and providing guidance and advice for common decisions in the field of professional basketball.For game outcome prediction, four different binary classification techniques are compared according to their accuracy prediction: Linear Regression, SVM (Support Vector Machines), Logistic Regression, and ANN (Artificial Neural Networks).In the same paper, k-Means is applied to infer optimal player positions and Outlier Detection to identify outstanding players.Neural networks have also been used for predictive end in [9].Some others works focused on the prediction of points scored and therefore the results of games using regression models [11].
The clustering was mainly used in the construction of teams [1].However, in our work, studying similarities between performances of players is only a first step in the acquisition process.Indeed, the resulting groups aim to determine the weaknesses of a team and help choose the right player(s) that will reinforce the team.This will be done using clustering and classification.Thereafter, our goal is to predict the performance of a player (assume selected) and his salary to determine whether he is a good option for the team.Regression and time series are used for this purpose.

Data
We have chosen to use the data source " Basketball-references.com", a web site with a rich data base.One can find there, statistics on over 60 NBA seasons.They are reliable and contain few missing values [3].The elements that point out the performance of a player during the game are divided into four categories: -Defensive performance: these are indicators that ensure the defensive play of a player.They are represented by: blocks, defensive rebounds and steals.-Offensive performance: indicators that show the offensive play.They are represented by: Offensive personal fouls and rebounds.-Scoring: all indicators that have a relationship with the scoring and points: the attempts of field goals, attempts goals successful on the field, attempts goals from three points on the ground, the attempts of goals successful threepoint field, attempts goals from two points on the ground, successful goals attempts goals from two points on the ground, free throws, the successful free throws.-Play-making : It gives information on the participation of a player in building the game on the field.It is represented by: the assists, turnovers, number of games played and minutes spent on the field.The performance of a given team is represented by the sum of performance indicator values of all its players.
We worked on two types of indicators to best meet our needs in terms of players performance description.The first type consists of basic statistics (rebound, interception points...) as they give a summary of the performance during a particular season.The second type allows a more detailed overview of players performances (a zoom) according to a particular event (match or possession).

Data Mining tasks for basket ball analysis
In this section we will present the different DM tasks that we used for basket ball analysis.

Clustering
Clustering is a descriptive task of DM.It consists on partitioning the data into homogenous clusters using similarity measures.The objective of applying clustering on a set of NBA players is to form groups of players statistically homogenous as they differ in their playing style.In this study, the data used for clustering represents the 10 NBA seasons from 2005-06 until 2013-14.
The clustering allows to have an overview on Player's Skills and to compare their efficiency according to the partition they belong to.As each player plays in a particular position, clustering is also used to compare players of the same position according to their performance.It also provides decision makers with elements for future analysis that enable them to identify potential undervalued players or even to propose appropriate salaries to the players.
Moreover, the use of clustering could be extended for comparative purposes.This is done by applying clustering after the prediction in order to compare the salaries of the players based on their future statistics.This is useful when multiple players meet the need of the team.In this case, the one whose value increases will have an advantage over others, whether for performance or salary.We used K-means algorithm for its simplicity and efficiency on large dataset

Classification
Classification is a predictive task of DM.It consists on affecting new objects to existing groups.In our case, we apply classification on the results of clustering.It will determine to which cluster, a new player can belong according to his features.Hence, this classification allows to have an overview on the players competences depending on the cluster to which he belongs and the players therein.This classification is interesting when acquiring a new player specifically when one wants to substitute a player against a new one, having the same characteristics or to fill a gap in the team.Indeed, once the clusters have been defined using a sample of players according to their skills, the classification will determine to which cluster a new player will belong according to its descriptive features .
We have chosen to use Nave Bayes algorithm (see [10] for more details).It is a supervised classification algorithm based on the Bayes theorem with strong features independence hypothesis.Its principle is as follows: suppose, we have K clusters, and we went to know cluster of a new entries X. X is affected to a cluster C k if: Where, P (C j ) is the apriori probability of the cluster C j and P (X/C j ) the probability of X in cluster C j .The advantage of this algorithm is that it requires relatively little training data.

Prediction
Basketball is a very competitive sport and a great business.As a result, prediction becomes important.Indeed, NBA statistics have particular relevance for team managers and coaches.When acquiring a new player or signing and renewing a contract, it is possible to project the player history in order to predict future trends of his performance and salary, and make the right decisions.This is done by studying and analysing the existing relationships in past occurrences of that players performances.In this perspective, the multiple linear regression and exponential smoothing on a set of historical data retrieved from the database according to specific needs are used.We can use the results of predictive analysis to evaluate a given player and predict future values of his performance and salary.Moreover, we can predict teams costs.
Performance Prediction Several metrics and measures are used to evaluate the performance of a player in the NBA.Some represents performance detailed on several axes (defensive, offensive, shooting rate), some reflect the player's contribution to the victory of his team and others are more general and global.We are interested, in this study by predicting the PER (Player Efficiency Rating).
It is a performance measure that summarizes the performance of a player during a season in a single number.This metric reflects individual performance of a player in several sports.It was proposed by [7].In Basketball, Hollinger's formula takes into account many variables that represent the positive achievements (the points scored on the ground (FG), defensive rebounds (DRB ) ...), the negative effects (personal faults (FP ), team faults (FT ) ...) and adjust them according to games' time and games rhythm.The value of the average PER of the league is set to 15.00; it allows to compare the performances of players over the seasons.This task allows to predicts the future value of PER based on a set of performance indicators selected.To do so, we use linear regression model to predict this performance to evaluate players performance indicators and selecting those that have a significant effect on the PER and predict their values.
Salaries prediction Players salaries through the seasons are considered as time series.Each observation is associated to a particular season.The income of player at time t is a function of previous incomes.The future value of the players salary can be predicted in short term, using exponential smoothing.This method is used when the series (T observations) are not seasonal.

Tests and Results
In this section we will show through different scenarios, how the different tasks of data mining, cited above, can help managers make decisions on acquiring efficient players for basket ball teams.Indeed, we have as objective to group players according to their performance, evaluate and predict player performance and predicting the salaries of players and franchises costs.

Clustering
We present two scenarios of clustering.In the first one, the players are grouped according to their performances and in the second one, we focus on the move of players from cluster to another across the seasons.
Scenario 1 We will present in what follows the result of a clustering which purpose is to have a comprehensive view of the talents distribution, either in the league or in a team.It also provides an overview of the skills of the players according to the cluster they belong to compared to their opponent.We applied K-means, with k = 6, on features of 337 players of season 2012-2013.The used feature were: Minutes played (MP ), Field Goals (FG), Field Goals Attempt (FGA,), basket of threes Points (3P ), basket of threes Points Attempt (3PA), basket of two Points (2P ), basket of two points attempt (2PA), Free Throw (FT ), Free Throw Attempt (FTA), Offensif ReBound (ORB ), Defensive ReBound (DRB ), ASsisTs (AST ), Interception (STL), Block (BLK ), Turnover (TOV ), Points (PTS ), Player Efficiency Rating (PER).Table 1 represents this clustering by giving the average of each feature for all clusters We distinguish some clusters from others.Indeed cluster1 contains imposing statistics.It is a superstar cluster as the best NBA players belongs there.It contains 25 players from a total of 337.It includes players who have proven their talent by playing exceptional season and winning awards such as: Most Valuable Player (best player of the regular season) and NBA All-Defensive Second Team (best defensive player of the regular season).They participated in the NBA, All Star and won the NBA championship with their team.Among them, we have: Lebron James, Stephen Curry, Kobe Bryant, Kevin Durant ... According to the results of the Table 1, one can see that cluster 4 includes excellent rebounders who scored many two points baskets.However, cluster 2 is the one with lower statistics particularly in defence (STL, BLK...).Cluster 3 also displays statistics that are far from impressive, particularly in defence.It includes players like Louis Amundson and Joel Anthony who are considered by the site Bleacherreport.comas "50 Most Worthless Players in the NBA for the 2012-13 Season".We can also find Rodrigue Beaubois, a player from whom one expected a lot but he did not have great performances.The distribution of the players is given by the following Figure 1.Almost half of the sample players are in clusters 2 and 3 with low to medium skills.Scenario 2 For this clustering, the goal was to observe a variation in players performances from a season to another according to their productivity and substitution of players and talented players in the team.For this purpose, we have chosen to work on Phoenix Sun players for the two seasons 2011-12 and 2012-2013.We tested the clustering only on players that were part of the team during the two seasons.K-means with K = 3 was used.The results are given in tables the 2013-14 season, his excellent statistics affected him to cluster 1 The site  perfectly to the cluster to which he belongs, as the average salary of cluster 2 is 1, 774, 548.

Performance prediction
We have selected players who have played several seasons and apply predictions on their statistics.1-To predict Performance of the players, we used the following multiple regression model: P ER = 21.436− 0.0426G − 0.0068M P + 0.0419a 3 F G − 0.0118F GA + 0.0215F T −0.00467F T A + 0.0183ORB + 0.00433DRB + 0.0135AST + 0.0222ST L +0.0171BLK − 0.0196T OV − 0.0113P F.
Multiple linear regression tests conducted on R software showed that all variables are statistically significant with p-value under = 5%.4-The prediction of players salaries was also tested using multiple regression model and significant explanatory variables are selected.These variables are also been predicted using exponential smoothing since they occurred each seasons.Hence the salaries of the players of Phoenix Suns team are predicted using the following estimated regression model: According of the predicted salaries of the players of Phoenix Suns team, we obtained the predicted team franchise costs as shown in Figure 5.

Conclusions
In this work, we aim to assist NBA sports decision makers in the process of acquisition of players by exploiting and applying the data mining techniques as k means, Naive Bayes algorithm, linear regression and time series on large amounts of data.These data are the statistics of ten seasons in the NBA (from 2005-06 to 2014-15).The results we have reached, allow decision makers to identify their needs, determine the players who respond to this need and to select the one or ones that work best for them depending on their current and future performance, taking into account their cost.

Fig. 3 .
Fig. 3. Predicted and true values comparison of Michael Jordan interceptions across the seasons.

Table 4 .
Thompson and Wall statistics in cluster1.

Table 5 .
Isaiah Thomas statistics in cluster 5.Bleacherreport.comalso quoted some players as being the worst in the NBA, among them, Brian Cook.He averaged 5.7 points and 2.7 rebounds in less than 14 minutes per game and spent a lot of time on the bench during his career.The last season for Cook is 2011-12; so we took the statistics of this last season to predict to which cluster he will be assigned.The algorithm classified him in cluster2.A very natural result because Cook spent only 276 minutes on the playground which generated very weak statistics in defence as in offense (

Table 6 )
The average salary of Cook is 1, 955, 569; an average salary that matches

Table 6 .
Isaiah Thomas statistics in cluster 2.