library("parameters")
library("factoextra")
library("NbClust")
library("cluster")
library("formatR")
library("tidyverse"); theme_set(theme_minimal())
library("ClusterR")
library("mclust")
library("easystats")
library("here")
library("knitr")
library("kableExtra")
library("condformat")
library("formattable")
library("reactablefmtr")
library("scales")
library("plotly")
library("flextable")K-Means Clustering with NBA Data
Overview
Cluster analysis is a statistical analysis tool that partitions observations into sub-populations of similar characteristics within the data set. This process can be useful, because similar observations often behave and respond to stimuli in similar ways. Identifying clusters can allow researchers to predict and draw conclusions on the behavior of certain groups. There are many popular topics that use cluster analysis: risk analysis, marketing, real estate, insurance, medical research, and earthquakes.
In this module, we’ll use the clustering of NBA players as an example. Suppose you were an NBA General Manager interested in constructing a high-quality team. The best teams use lots of different kinds of players to achieve their goals. Golden State Warriors Guard Stephen Curry is an incredible shooter and ball-handler, but the Warriors need other kinds of players, too. A team comprised completely of Stephen Curry and his clones would struggle to defend or rebound the ball. The team would also struggle to give each Stephen Curry the playing time and shots that he has come to expect. Instead, General Managers can separate potential players into groups, because it helps them to identify their team needs. This is where cluster analysis proves useful.
For this exercise, imagine that you are the General Manager of the Dallas Mavericks. You are tasked with creating a strong, balanced team. Later in the module, you will have an opportunity to create hypothetical trade scenarios that could benefit the team.
Getting Started
Required Packages
We will be using the following packages in this module. Take the time now to make sure these packages are installed and loaded on your computer.
The Data
Our data for this exercise comes from the 2021-2022 NBA Season. This season, the Mavericks finished 4th in the Western Conference with 52 wins and 30 losses under coach Jason Kidd. They exceeded expectations and made the Western Conference Finals.
Our data includes 374 players. Each of these 374 players fulfilled our requirements of appearing in at least 25 games and playing an average of at least 12 minutes (a complete game is 48) in those games. Because of midseason trades or acquisitions, some of the players will appear in our data twice. That’s because they fulfilled our playing time requirements for two different teams in the same season. The second iteration of the player will be marked with a 1 following his name (i.e. Smith becomes Smith1). We’ve divided the variables into two data sets.
The first set of variables are focused on determining the influence a player has on the game. Some of these variables are the players’ minutes per game, total games played and started, points and rebounds per game, and field goal attempts per game. This will be helpful in clustering the players into groups of stars, average starters, and reserves. We’ve termed this data set “usage”. Below is a data dictionary for the first set of variables.
Variable | Explanation | Example |
|---|---|---|
Name | nba player's first and last name | Trae Young or Trae Young1 |
POS | playing position | PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center) |
Team | abbreviation of city of player's team | atl (Atlanta), bos (Boston), etc. |
GP | total games played | 46, 70, etc. |
GS | total games started | 7, 56, etc. |
MIN | minutes per game | 18.2, 30.2, etc. |
PTS | points per game | 6.8, 14.9, etc. |
AST | assists per game | 1.1, 3.5, etc. |
TO | turnovers per game | 0.8, 1.7, etc. |
STL | steals per game | 0.5, 1.1, etc. |
OR | offensive rebounds per game | 0.5, 1.4, etc. |
DR | defensive rebounds per game | 2.3, 4.1, etc. |
BLK | blocks per game | 0.2, 0.6, etc. |
PF | personal fouls per game | 1.5, 2.4, etc. |
FGM | field goals made per game | 2.6, 5.5, etc. |
FGA | field goals attempted per game | 5.4, 12.2, etc. |
3PM | 3-point field goals¬ made per game | 0.6, 1.9, etc. |
3PA | 3-point field goals attempted per game | 1.9, 5.2, etc. |
FTM | free throws made per game | 0.8, 2.2, etc. |
FTA | free throws attempted per game | 1.1, 2.8, etc. |
PER | player efficiency rating metric | 11.74, 17.27, etc. |
SC-EFF | scoring efficiency | 1.162, 1.332, etc. |
SH-EFF | shooting efficiency | 0.48, 0.56, etc. |
And here is a small slice of the usage data set.
Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF | SH-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trae Young | PG | atl | 76 | 76 | 34.9 | 28.4 | 9.7 | 4.0 | 0.9 | 0.7 | 3.1 | 0.1 | 1.7 | 9.4 | 20.3 | 3.1 | 8.0 | 6.6 | 7.3 | 25.48 | 1.396 | 0.54 |
John Collins | PF | atl | 54 | 53 | 30.8 | 16.2 | 1.8 | 1.1 | 0.6 | 1.7 | 6.1 | 1.0 | 3.0 | 6.3 | 11.9 | 1.2 | 3.3 | 2.5 | 3.1 | 18.75 | 1.360 | 0.58 |
Bogdan Bogdanovic | SG | atl | 63 | 27 | 29.3 | 15.1 | 3.1 | 1.1 | 1.1 | 0.5 | 3.5 | 0.2 | 2.1 | 5.4 | 12.6 | 2.7 | 7.3 | 1.5 | 1.8 | 15.49 | 1.196 | 0.54 |
De'Andre Hunter | SF | atl | 53 | 52 | 29.8 | 13.4 | 1.3 | 1.3 | 0.7 | 0.5 | 2.8 | 0.4 | 2.9 | 4.8 | 10.8 | 1.4 | 3.7 | 2.4 | 3.1 | 10.66 | 1.233 | 0.51 |
Kevin Huerter | SG | atl | 74 | 60 | 29.6 | 12.1 | 2.7 | 1.2 | 0.7 | 0.4 | 3.0 | 0.4 | 2.5 | 4.7 | 10.3 | 2.2 | 5.6 | 0.6 | 0.7 | 11.91 | 1.174 | 0.56 |
The second set of variables are helpful in determining a player’s role or function in the game. Some of these variables are Field Goal Percentage, Height, and Weight. Lots of the common variables have been converted into per minute values in order to isolate their frequency. These players will be divided into sub-groups like scorers, big men, and wings. We’ve termed this data set “role”. Below is a data dictionary for the second set of variables.
Variable | Explanation | Example |
|---|---|---|
Name | nba player's first and last name | Trae Young or Trae Young1 |
POS | playing position | PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center) |
Team | abbreviation of city of player's team | atl (Atlanta), bos (Boston), etc. |
Height | height in inches | 76, 81, etc. |
Weight | weight in pounds | 200, 234, etc. |
PTSPerMin | points per minute | 0.356, 0.515, etc. |
ASTPerMin | assists per minute | 0.055, 0.133, etc. |
TOPerMin | turnovers per minute | 0.036, 0.065, etc. |
STLPerMin | steals per minute | 0.023, 0.038, etc. |
ORPerMin | offensive rebounds per minute | 0.022, 0.066, etc. |
DRPerMin | defensive rebounds per minute | 0.101, 0.175, etc. |
BLKPerMin | blocks per minute | 0.009, 0.027, etc. |
PFPerMin | fouls per minute | 0.064, 0.099, etc. |
FGP | field goal percentage | 0.417, 0.496, etc. |
FGMPerMin | field goals made per minute | 0.131, 0.192, etc. |
FGAPerMin | field goals attempted per minute | 0.284, 0.419, etc. |
3PP | 3 point percentage | 0.306, 0.379, etc. |
3PMPerMin | 3 point field goals made per minute | 0.029, 0.072, etc. |
3PAPerMin | 3 point field goals attempted per minute | 0.094, 0.192, etc. |
FTP | free throw percentage | 0.709, 0.842, etc. |
FTMPerMin | free throws made per minute | 0.039, 0.087, etc. |
FTAPerMin | free throws attempted per minute | 0.053, 0.112, etc. |
And here is a small slice of the role data set.
Name | POS | Team | Height | Weight | PTSPerMin | ASTPerMin | TOPerMin | STLPerMin | ORPerMin | DRPerMin | BLKPerMin | PFPerMin | FGP | FGMPerMin | FGAPerMin | 3PP | 3PMPerMin | 3PAPerMin | FTP | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trae Young | PG | atl | 73 | 180 | 0.814 | 0.278 | 0.115 | 0.026 | 0.020 | 0.089 | 0.003 | 0.049 | 0.460 | 0.269 | 0.582 | 0.382 | 0.089 | 0.229 | 0.904 | 0.189 | 0.209 |
John Collins | PF | atl | 81 | 235 | 0.526 | 0.058 | 0.036 | 0.019 | 0.055 | 0.198 | 0.032 | 0.097 | 0.526 | 0.205 | 0.386 | 0.364 | 0.039 | 0.107 | 0.793 | 0.081 | 0.101 |
Bogdan Bogdanovic | SG | atl | 78 | 220 | 0.515 | 0.106 | 0.038 | 0.038 | 0.017 | 0.119 | 0.007 | 0.072 | 0.431 | 0.184 | 0.430 | 0.368 | 0.092 | 0.249 | 0.843 | 0.051 | 0.061 |
De'Andre Hunter | SF | atl | 80 | 225 | 0.450 | 0.044 | 0.044 | 0.023 | 0.017 | 0.094 | 0.013 | 0.097 | 0.442 | 0.161 | 0.362 | 0.379 | 0.047 | 0.124 | 0.765 | 0.081 | 0.104 |
Kevin Huerter | SG | atl | 79 | 190 | 0.409 | 0.091 | 0.041 | 0.024 | 0.014 | 0.101 | 0.014 | 0.084 | 0.454 | 0.159 | 0.348 | 0.389 | 0.074 | 0.189 | 0.808 | 0.020 | 0.024 |
Part 1: Idea of similarity/distance - Interactive
Below is a set of ten Dallas Maverick Players from 2021-2022 that met our playing-time restrictions. Kristaps Porzingis was traded in the middle of the season, but he still met our playing-time qualifications for the Dallas Mavericks. For this example, we’ve combined a few of the variables from both the usage and role data sets. Consider the players Sterling Brown, Maxi Kleber, Dwight Powell, and Josh Green.
Name | Height | Weight | MIN | PTS | OR | DR | AST | STL | BLK | TO | 2PA | 2PP | 3PA | 3PP | 3PAPerMin | ORPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Luka Doncic | 79 | 230 | 35.4 | 28.4 | 0.9 | 8.3 | 8.7 | 1.2 | 0.6 | 4.5 | 12.8 | 0.528 | 8.8 | 0.353 | 0.249 | 0.025 |
Kristaps Porzingis | 87 | 240 | 29.5 | 19.2 | 1.9 | 5.8 | 2.0 | 0.7 | 1.7 | 1.6 | 9.9 | 0.537 | 5.1 | 0.283 | 0.173 | 0.064 |
Jalen Brunson | 73 | 190 | 31.9 | 16.3 | 0.5 | 3.4 | 4.8 | 0.8 | 0.0 | 1.6 | 9.6 | 0.545 | 3.2 | 0.373 | 0.100 | 0.016 |
Tim Hardaway Jr. | 77 | 205 | 29.6 | 14.2 | 0.3 | 3.4 | 2.2 | 0.9 | 0.1 | 0.8 | 5.4 | 0.473 | 7.2 | 0.336 | 0.243 | 0.010 |
Dorian Finney-Smith | 79 | 220 | 33.1 | 11.0 | 1.5 | 3.2 | 1.9 | 1.1 | 0.5 | 1.0 | 3.2 | 0.599 | 5.4 | 0.395 | 0.163 | 0.045 |
Dwight Powell | 82 | 240 | 21.9 | 8.7 | 2.1 | 2.8 | 1.2 | 0.5 | 0.5 | 0.8 | 4.4 | 0.703 | 0.5 | 0.351 | 0.023 | 0.096 |
Reggie Bullock | 78 | 205 | 28.0 | 8.6 | 0.5 | 3.1 | 1.2 | 0.6 | 0.2 | 0.6 | 1.6 | 0.550 | 5.8 | 0.360 | 0.207 | 0.018 |
Maxi Kleber | 82 | 240 | 24.6 | 7.0 | 1.2 | 4.7 | 1.2 | 0.5 | 1.0 | 0.8 | 1.7 | 0.586 | 4.3 | 0.325 | 0.175 | 0.049 |
Josh Green | 77 | 200 | 15.5 | 4.8 | 0.8 | 1.6 | 1.2 | 0.7 | 0.2 | 0.7 | 2.7 | 0.573 | 1.2 | 0.359 | 0.077 | 0.052 |
Sterling Brown | 77 | 219 | 12.8 | 3.3 | 0.5 | 2.5 | 0.7 | 0.3 | 0.1 | 0.5 | 1.3 | 0.492 | 1.9 | 0.304 | 0.148 | 0.039 |
Exercise 1
For these four players, compare their available statistics.
Which of the four players are most similar kinds of players? Which variables make them similar?
Which variables do they most differ? Which of the four players are the most “different”? Which variables differentiate them the most? Are they similar in any of the categories?
One common and effective way to compare the similarity of two points (or in this case, players) is the euclidean distance formula. The distance formula is found by the following formula:
\(d = \sqrt{(x_{2} - x_{1})^{2} + (y_{2} - y_{1})^{2}}\)
You can visualize this as drawing the shortest line possible between two points and then measuring it. Right now, our variables are in different units (inches, pounds, points, percentage, etc.), so we’ll standardize (more on this later) each of the variables, so the units are equal. This helps each variable to have equal importance in our distance formula.
Below is a table of the distances between each of the players. Match up the player in the column with the player in the row and you’ll find the distance between them. The smaller the value, the more similar the players are.
Dwight Powell Maxi Kleber Josh Green
Maxi Kleber 4.269475
Josh Green 4.554980 4.270914
Sterling Brown 5.846063 3.940473 3.102775
Below is a visualization of the distances. As the distances increase, the color changes from red to blue. Players matched with themselves will be dark red, because their distance is 0.
fviz_dist(Distance, gradient = list(low = "indianred3",mid = "white", high = "dodgerblue3"))Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Exercise 2
Do the tabulated results agree with your previous assessment?
Which is more accurate: your original assessment or the similarity metric?
Part 2: Performing a Cluster Analysis
Calculating the distance between points is the first step in a distance-based cluster analysis. The players with the smallest distance (or with the most similarity) between them are naturally placed in a cluster together.
How does the clustering actually work? As an illustration, we’ll use a basic plot of the Offensive Rebounds and 3-Point Shooting of our Dallas Mavericks players. We’ve standardized the results by adjusting them to per-minute values.
Exercise 3
What do you notice about the data? How would you group the players?
How would you describe these groupings?
In a cluster analysis, every point needs to belong to a cluster. Do any points not seem to have a cluster?
Cluster analysis is the process of partitioning the data into sub-populations or clusters. This is done so that observations in the same cluster are more similar to each other than observations in a different group. These clusters then can be analyzed.
One common method to divide the data into these clusters is distance based and uses the K-Means Algorithm. The k-means algorithm partitions the data into clusters which can then be analyzed. Furthermore, this is performed in an unsupervized fashion. This means that the clusters are found by the algorithm and not predetermined by the researcher. In the NBA example, we cannot determine our clusters beforehand. The algorithm may confirm our original intuition, but this is not guaranteed.
The K-Means Algorithm assigns the data into clusters so that the sum squared distance between the center (or mean) of the clusters and each observation is minimized. At the end, the variance of the all the points within each cluster is as small as possible. One downside of the K-Means Algorithm is that users must predetermine the number of clusters they’d like to create. This is entered as the parameter, K. Let’s say we want to separate our data into K = 2 clusters. The K-Means algorithm will go through four basic steps:
- Randomly select two initial cluster centers.
- Assign each observation to the closest center.
- Calculate the mean of all the observations within each cluster. These cluster means become the new center of each cluster.
- Repeat steps 2-3 until no further changes are made.
As these steps are followed, the clusters will move closer and closer to their final positions. Since the first step is to randomly assign cluster centers, the K-Means approach can occasionally yield different results. It’s worth trying it a few different times with different starting points.
Before you look below, provide your estimation of the two clusters of our Dallas Mavericks players. Where would you anticipate the cluster centers to be located?
The code below runs the k-means algorithm. In the kmeans function, the first argument is the data, the second is the number of clusters to be fit (i.e. \(k\)) and nstart is the number of random starting points to use for the algorithm.
set.seed(321)
dallasKMeans_prep <- dallas %>%
select(Name, `3PAPerMin`, ORPerMin) %>%
column_to_rownames(var = "Name")
dallas2Means <- kmeans(dallasKMeans_prep, centers = 2, nstart = 50)Exercise 4
Is this how you would have grouped the players?
Notice the large points in the middle of each cluster. These are the cluster centers. Are they where you expected?
How do you think the groupings will change with three clusters?
How do you think the groupings will change with three clusters? We can easily tell K-Means to randomly assign three centers, and the process of assigning points to cluster means will continue exactly as before.
set.seed(3)
dallas3Means <- kmeans(dallasKMeans_prep, centers = 3, nstart = 50)
dallas3fviz <- fviz_cluster(dallas3Means, dallasKMeans_prep,
show.clust.cent = TRUE, stand = FALSE,
labelsize = 7, pointsize = 1,
main = "Mavericks K = 3 Clusters",
xlab = "3 Point Attempts Per Minute",
ylab = "Offensive Rebounds Per Minute")
dallas3fvizOr four clusters?
set.seed(22329)
dallas4Means <- kmeans(dallasKMeans_prep, centers = 4, nstart = 50)
dallas4fviz <- fviz_cluster(dallas4Means, dallasKMeans_prep,
show.clust.cent = TRUE, stand = FALSE,
labelsize = 7, pointsize = 1,
main = "Mavericks K = 4 Clusters",
xlab = "3 Point Attempts Per Minute",
ylab = "Offensive Rebounds Per Minute")
dallas4fvizExercise 5
What happens to Dwight Powell when we increase $k$ to 4?
Would Dwight be considered an outlier? Why? Is this helpful from a clustering perspective?
Now consider five clusters.
set.seed(102)
dallas5Means <- kmeans(dallasKMeans_prep, centers = 5, nstart = 50)
dallas5fviz <- fviz_cluster(dallas5Means, dallasKMeans_prep,
show.clust.cent = TRUE, stand = FALSE,
labelsize = 7, pointsize = 1,
main = "Mavericks K = 5 Clusters",
xlab = "3 Point Attempts Per Minute",
ylab = "Offensive Rebounds Per Minute")
dallas5fvizAt some point, the power of clustering the points begins to fade. Does Dwight Powell deserve to be in a cluster of his own? Possibly. Does Reggie Bullock? Definitely not.
Exercise 6
Which of the four values of K did you find most useful or accurate?
Were there ever too few or too many clusters?
Part 4: Choosing the Number of Clusters
So, how can we choose the optimal number of clusters?
It’s helpful to evaluate the effectiveness of the clusters for each value K. There are plenty of ways to test this effectiveness, but we’ll walk through a common example called the Elbow Method. The Elbow Method totals up the distance between the centers of each cluster and their observations. This is called the Total Within Summed Squares (TWSS). As K increases and more clusters are added to the model, the sum squared distance will decrease. Eventually, the value of each additional cluster diminishes. The Elbow Method plots the results, and the user can look for a point when increasing the number of clusters no longer proves useful. Often, this point looks like an Elbow.
fviz_nbclust(dallasKMeans_prep, kmeans, method = "wss", k.max = 9) +
theme_minimal() +
labs(title = "The Elbow Method")The graph demonstrates that the value of each additional cluster decreases as more clusters are added. The bends in the graph indicate that clusters beyond four have little value. Despite being common, the Elbow Method is often ambiguous and difficult to interpret. Look for the bend in the Elbow Plot. K = 2, K = 3, and K = 4 all seem like reasonable conclusions.
The Elbow plot is just one test to determine the optimal number of clusters. Two other popular methods are the Average Silhouette Method and the Gap Statistic Method. In all, there are dozens of methods to determine the ideal number of clusters and they often disagree. We’ll take a consensus of 27 methods and proceed from there.
dallasClust <- n_clusters(dallasKMeans_prep,
package = c("easystats", "NbClust"),
standardize = FALSE, n_max = 5)
plot(dallasClust)The tests give varied estimates for the optimal clusters, but it is up to the user to decide how many clusters you will include in your K-Mean Algorithm. It’s common practice to choose several and compare the results of each.
From there, we would conduct our analysis of each cluster and examine the results.
After the clustering is completed, how can we analyze our clustering solution?
We want to reduce the Total Within Summed Squares (TWSS) or distance from each observation to its cluster mean, but we also want to minimize the total number of clusters used.
Two helpful measurements to summarize these preferences for our clusters are intra-class similarity and inter-class similarity.
Intra-class similarity tests the relationship between observations of the same cluster. We want this similarity to be high. We want all the observations in a cluster to exhibit similar features.
Inter-class similarity tests the relationship between different clusters. We want this relationship to be low. Ideally, each cluster is distinct and the observations within can be clearly assigned to a cluster.
As we increase the number of clusters, K. The intra-class similarity will increase, because observations will be assigned to smaller clusters that a more representative. However, the inter-class similarity will also increase, because the cluster centers are now closer together. This is why it is impractical to choose a large value for K.
Recall our clustering for the Dallas Mavericks players.
dallas2fviz
dallas3fviz +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
dallas4fviz +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())Exercise 7
Which value of K has the highest intra-class similarity?
Which cluster specifically?
Which value of K has the highest inter-class similarity?
Part 5: A Larger Dataset
Let’s focus now on our larger data set with many more variables and observations. It seems like it’d be more complicated, but the process is almost exactly the same. One important distinction to remember is that the large number of dimensions make the data difficult to visualize. There are different methods that aid in this visualization. We’ll walk you through the usage data set and demonstrate appropriate analysis, and then allow you to work through the role data set.
Remember the usage data set? It contains variables aimed at categorizing the workload and skill of the players. We hope to divide players into sub-groups like stars and bench players.
It is very important that we standardize the data first. Lots of our variables have different units. Games played and Blocks per game are hard to compare without scaling. Without standardizing, the large values- like Games Started or Games Played- will exert too much influence on the data. Now, each value is described in relation to the other observations. After standardizing, Trae Young’s assist total is 3.656, so we know that he has a lot more assists than the average player in our data set. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.
Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF | SH-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trae Young | PG | atl | 1.212 | 1.690 | 1.478 | 2.819 | 3.656 | 3.186 | 0.346 | -0.422 | -0.191 | -0.961 | -0.433 | 2.418 | 2.446 | 2.049 | 1.883 | 3.457 | 2.948 | 2.481 | 0.910 | 0.085 |
John Collins | PF | atl | -0.221 | 0.820 | 0.903 | 0.802 | -0.386 | -0.291 | -0.479 | 0.857 | 1.514 | 1.296 | 1.704 | 0.984 | 0.623 | -0.071 | -0.116 | 0.540 | 0.502 | 0.924 | 0.676 | 0.797 |
Bogdan Bogdanovic | SG | atl | 0.365 | -0.163 | 0.692 | 0.620 | 0.279 | -0.291 | 0.896 | -0.678 | 0.036 | -0.710 | 0.224 | 0.568 | 0.775 | 1.603 | 1.585 | -0.172 | -0.255 | 0.170 | -0.389 | 0.085 |
De'Andre Hunter | SF | atl | -0.286 | 0.782 | 0.762 | 0.339 | -0.642 | -0.052 | -0.204 | -0.678 | -0.362 | -0.209 | 1.539 | 0.290 | 0.385 | 0.152 | 0.054 | 0.469 | 0.502 | -0.948 | -0.149 | -0.449 |
Kevin Huerter | SG | atl | 1.082 | 1.085 | 0.734 | 0.124 | 0.074 | -0.172 | -0.204 | -0.806 | -0.248 | -0.209 | 0.882 | 0.244 | 0.276 | 1.045 | 0.862 | -0.812 | -0.895 | -0.658 | -0.532 | 0.441 |
Let’s begin by taking a look at the Elbow plot of the usage dataset.
usage_rm <- usage %>%
select(-Name, -POS, -Team) %>%
mutate(across(where(is.numeric), standardize))
fviz_nbclust(usage_rm, kmeans, method = "wss", k.max = 24) +
theme_minimal() +
labs(title = "The Elbow Method")The Elbow plot shows that the algorithm experiences diminishing returns after K = 2 and K = 3. From the Elbow Plot, we would expect that the consensus lies somewhere between 2 and 5 clusters. Now consider the multiple methods for the selection of $k$.
The tests favor three clusters. Some tests also prefer two and four clusters, so those models are worth a look.
set.seed(121)
usage2Means <- kmeans(usageKMeans_prep, centers = 2, nstart = 50)
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)
set.seed(1210)
usage4Means <- kmeans(usageKMeans_prep, centers = 4, nstart = 50)K = 2 Clusters
Let’s start simple and begin with K = 2 clusters.
But before we begin, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering. If some have little or no influence, we can simplify our analysis by removing them.
The visualization below demonstrates the differences between our two clusters. The variables that have large differences are important in the clustering assignment. They greatly influence the assignment of an observation.
as_tibble(usage2Means$centers, rownames = "cluster") %>%
pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
group_by(variable) %>%
summarise(Influence = abs(mean(value))) %>%
mutate(
variable = factor(variable, levels = usage_levels)
) %>%
ggplot(aes(x = variable, y = Influence)) +
geom_bar(stat = "identity", fill = "cadetblue3") +
labs(title = "Influence on Cluster Assignment", x = "", y = "") +
theme(axis.text.y = element_blank(),
legend.position = "none",
axis.text.x = element_text(angle = -45, size = 9))This type of exercise is essential for clustering analysis, because it allows one to see which variables are important to consider when classifying an observation.
This visualization scales the centers of the variables for each cluster and contrasts them. Variables with large positive or negative values have a large influence on the clustering. These variables help differentiate the cluster. Variables with an influence close to 0 have less importance.
We see a great diversity in the variables that possess significant influence on the clustering.
Exercise 8
Which variables seem to contribute the most to the clustering result?
Which variables contribute the least to the clustering result?
Scoring Efficiency and Shooting Efficiency both lack influence. Games Played, Offensive Rebounds, and Blocks all also don’t contribute much to our clustering. We chose to remove Shooting Efficiency and keep the other four, but we easily could have removed them from our analysis.
Note for Reviewer. Removing the five variables causes a slight shift in the cluster assignment. This changes some of the analysis and points I was making on the outliers, and it makes comparison between K = 2 and K = 3 more difficult. We don’t remove any of the variables when K = 3. Still, it could make things confusing to not remove variables with very little influence. I’m open to suggestions on what to do here.
set.seed(121)
usage2Means <- usageKMeans_prep %>%
select(-`SH-EFF`) %>%
kmeans(centers = 2, nstart = 50)
usage2 <- usage %>% select(-`SH-EFF`)
usage_rm2 <- usage_rm %>% select(-`SH-EFF`)Now that we’ve removed some variables. Let’s see how many observations are within each cluster.
Cluster | Size |
|---|---|
1 | 119 |
2 | 255 |
The clusters are not identical in size, and it’s different enough that we should keep an eye on it. It’s important to verify that each of the clusters contain a significant number of observations. Like we saw with Dwight Powell earlier, sometimes small clusters can tell us valuable information about the observations they contain.
The K-Means Algorithm will assign each observation a cluster and print out descriptive statistics of each cluster. This can give us a good idea of what makes up each cluster. We went back and unstandardized the data.
usage2centers <- as_tibble(usage2Means$cluster) %>%
mutate(Name = usage$Name) %>%
rename(Clusters = value) %>% left_join(usage2, by = "Name") %>%
group_by(Clusters) %>%
summarise(
across(where(is.numeric), mean)
) %>%
mutate(across(where(is.numeric), round, digits = 3))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
usage2centers %>% flextable() %>% align(align = "center", part = "all") %>%
width(j = c(2:15), width = .5)Clusters | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60.689 | 56.706 | 32.572 | 18.656 | 4.417 | 2.266 | 1.035 | 1.141 | 4.845 | 0.582 | 2.337 | 6.773 | 14.592 | 1.926 | 5.386 | 3.178 | 3.969 | 17.824 | 1.281 |
2 | 55.847 | 19.455 | 20.540 | 7.937 | 1.685 | 0.913 | 0.652 | 0.978 | 2.779 | 0.437 | 1.789 | 2.958 | 6.429 | 0.954 | 2.728 | 1.071 | 1.429 | 13.324 | 1.244 |
Generally, it looks like cluster 1 contains starter caliber players and cluster 2 includes the bench players. This helps to explain why cluster 1 is a bit smaller than cluster 2.
Now, let’s look at the clusters graphically. This can help us to see how different the clusters really are from each other. The graph is created by combining the values of all the variables in a visually understandable way. This is through a process called Principle Component Analysis (PCA). Link to more defined explanation of PCA.
usage2fviz <- fviz_cluster(usage2Means, usageKMeans_prep,
geom = "point",
show.clust.cent = TRUE, stand = FALSE,
pointsize = 1,
main = "Usage K = 2 Clusters")
usage2fvizMany of the observations in both clusters lie close to the border. This indicates that the division between the clusters was close and there may be some observations that could have been placed in either cluster. The centers are fairly close and located at about (-3,0) and (2,0).
There are several large outliers in both clusters, but especially in the lower portion of the visualization in both clusters and the left portion cluster 1.
Prototypes
To help us understand the clusters better, let’s look at some players that fall very close to the cluster center. We’ll call the players that represent the cluster well prototype players.
usage2Means_scale <- as_tibble(usage2Means$centers) %>%
mutate(cluster = 1:2)
usage_fitted2Means <- usage2Means$cluster %>%
as_tibble() %>%
rename(cluster = value) %>% left_join(usage2Means_scale) %>% select(-cluster)Joining with `by = join_by(cluster)`
distances <- sqrt(rowSums((usage_rm2 - usage_fitted2Means)^ 2)) %>%
as_tibble() %>%
rename(distance = value) %>%
mutate(
Name = usage$Name,
Cluster = usage2Means$cluster
)
dist_slice1 <- distances %>%
arrange(distance) %>%
select(Name, Cluster, distance) %>%
filter(Cluster == 1) %>% slice(1:3)
dist_slice1 %>%
mutate(distance = round(distance, digits = 4)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3)Name | Cluster | distance |
|---|---|---|
Khris Middleton | 1 | 1.8408 |
Miles Bridges | 1 | 1.9950 |
Gordon Hayward | 1 | 2.2011 |
Exercise 9
Which player is closest to the center for Cluster 1?
Are there other players who are close to the center for Cluster 1 that could also be considered prototypes?
Look at the prototype players’ statistics to see if we characterize Cluster 1.
This would be a good opportunity to play highlights of one of the players or show a picture or something to keep people engaged.
prototype_k2c1 <- dist_slice1 %>% select(Name) %>% left_join(usage2)Joining with `by = join_by(Name)`
prototype_k2c1 %>% flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:15), width = .5)Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Khris Middleton | SF | mil | 66 | 66 | 32.4 | 20.1 | 5.4 | 2.9 | 1.2 | 0.6 | 4.8 | 0.3 | 2.4 | 6.8 | 15.5 | 2.5 | 6.6 | 3.9 | 4.4 | 18.19 | 1.298 |
Miles Bridges | SF | cha | 80 | 80 | 35.5 | 20.2 | 3.8 | 1.9 | 0.9 | 1.1 | 5.9 | 0.8 | 2.4 | 7.5 | 15.2 | 1.9 | 5.8 | 3.3 | 4.2 | 17.97 | 1.329 |
Gordon Hayward | SF | cha | 49 | 48 | 31.9 | 15.9 | 3.6 | 1.7 | 1.0 | 0.8 | 3.8 | 0.4 | 1.7 | 5.8 | 12.6 | 1.8 | 4.5 | 2.6 | 3.0 | 15.11 | 1.261 |
Consider Khris Middleton, Miles Bridges, and Gordon Hayward. The three players all play a similar position; one that allows them to contribute in all areas of the game. There was significant variety in the number of Games Played, but they Started in each game and received a lot of playing time. They all played over 30 Minutes per game and scored about 20 Points a game. Their Rebound, Assist, Block, and Turnover totals vary a little bit, but they are all fairly high. They all took and made roughly the same number of shots per game (15.2-15.9 FGA) and (6.8-7.5 FGM).
Let’s move on to cluster 2. First, notice how much smaller the distances are from the cluster 2 center. More observations lie close to cluster 2’s center than cluster 1. This is not entirely surprising, as there are almost 100 more players in cluster 2 than 1.
Again consider potential prototypes for the second cluster.
dist_slice2 <- distances %>% arrange(distance) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)
dist_slice2 %>%
mutate(distance = round(distance, digits = 4)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3)Name | Cluster | distance |
|---|---|---|
Blake Griffin | 2 | 1.1953 |
Torrey Craig | 2 | 1.2661 |
Rudy Gay | 2 | 1.2779 |
Blake Griffin is our prototype player of cluster 2. Torrey Craig and Rudy Gay are also strong representative of cluster 2 as well.
prototype_k2c2 <- dist_slice2 %>% select(Name) %>% left_join(usage2)Joining with `by = join_by(Name)`
prototype_k2c2 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:15), width = .5)Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Blake Griffin | PF | bkn | 56 | 24 | 17.1 | 6.4 | 1.9 | 0.6 | 0.5 | 1.1 | 3.0 | 0.3 | 1.7 | 2.4 | 5.6 | 0.7 | 2.6 | 1.0 | 1.4 | 13.77 | 1.147 |
Torrey Craig | SF | ind | 51 | 14 | 20.3 | 6.5 | 1.1 | 0.8 | 0.5 | 1.2 | 2.7 | 0.4 | 1.9 | 2.5 | 5.5 | 0.9 | 2.7 | 0.5 | 0.7 | 10.82 | 1.171 |
Rudy Gay | SF | utah | 55 | 1 | 18.9 | 8.1 | 1.0 | 0.9 | 0.5 | 1.0 | 3.4 | 0.3 | 1.7 | 2.9 | 6.9 | 1.3 | 3.7 | 1.1 | 1.4 | 13.06 | 1.177 |
Once again, the prototypes look like an average NBA player. They each played around 55 Games and Started in very few of them. They played about 17.1-20.3 Minutes a game and scored from 6.4-8.1 Points a game. Their Rebound, Assist, Steal, Block, Turnover, and Foul values are fairly low and generally close together. They also don’t take as many shots as cluster 1 - only about 6 Field Goal Attempts per game.
Outliers
Now, let’s look through some of the players that fall farthest from the center of their cluster. These players are cluster outliers. In these cases, the clustering least represents the observation. These players are very different from the center. It can be helpful to identify and explain outliers by comparing them to our prototype players. How do they differ? What attributes led to their classification?
Is there a way to only label a few of the points in the visualization
dist_slice3 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 1) %>% slice(1:2,5)
dist_slice3 %>%
mutate(distance = round(distance, digits = 4)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3)Name | Cluster | distance |
|---|---|---|
Rudy Gobert | 1 | 9.1597 |
Joel Embiid | 1 | 8.8545 |
Myles Turner | 1 | 6.6678 |
outlier_k2c1 <- dist_slice3 %>% select(Name) %>%
add_row(Name = "Khris Middleton") %>% add_row(Name = "Blake Griffin") %>% # want to add centers of clusters for reference
left_join(usage2) %>% arrange(desc(DR))Joining with `by = join_by(Name)`
outlier_k2c1 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:15), width = .5)Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rudy Gobert | C | utah | 66 | 66 | 32.1 | 15.6 | 1.1 | 1.8 | 0.7 | 3.7 | 11.0 | 2.1 | 2.7 | 5.5 | 7.7 | 0.0 | 0.1 | 4.6 | 6.7 | 24.76 | 2.022 |
Joel Embiid | C | phi | 68 | 68 | 33.8 | 30.6 | 4.2 | 3.1 | 1.1 | 2.1 | 9.6 | 1.5 | 2.7 | 9.8 | 19.6 | 1.4 | 3.7 | 9.6 | 11.8 | 31.24 | 1.558 |
Myles Turner | C | ind | 42 | 42 | 29.4 | 12.9 | 1.0 | 1.3 | 0.7 | 1.5 | 5.5 | 2.8 | 2.8 | 4.8 | 9.4 | 1.5 | 4.4 | 1.9 | 2.5 | 17.45 | 1.374 |
Khris Middleton | SF | mil | 66 | 66 | 32.4 | 20.1 | 5.4 | 2.9 | 1.2 | 0.6 | 4.8 | 0.3 | 2.4 | 6.8 | 15.5 | 2.5 | 6.6 | 3.9 | 4.4 | 18.19 | 1.298 |
Blake Griffin | PF | bkn | 56 | 24 | 17.1 | 6.4 | 1.9 | 0.6 | 0.5 | 1.1 | 3.0 | 0.3 | 1.7 | 2.4 | 5.6 | 0.7 | 2.6 | 1.0 | 1.4 | 13.77 | 1.147 |
Sometimes, you’ll need to do some digging on the outliers. We chose to show you Khris Middleton and Blake Griffin’s characteristics again for comparison. Joel Embiid, Giannis Antetokounmpo, and Myles Turner represent two very different kinds of outliers. Embiid and Giannis are superstars. They finished second and third in the MVP voting in the 2021-2022 season. They are very far from the prototype of cluster 1, but they are even further from the prototype of cluster 2. These are the points near (-10, -5) in the visualization.
Myles Turner, however, possesses some attributes that could be classified as cluster 1 and cluster 2. He played lots of Minutes, Started most games, and had strong Rebounding values. However, his shooting numbers fall right between the clusters, and he doesn’t tally very many Points, Assists, Steals, or Turnovers. This point is likely the (-5, -9) outlier in the visualization. He is a borderline case. Is there a more statistical word for this?
dist_slice4 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)
dist_slice4 %>%
mutate(distance = round(distance, digits = 4)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3)Name | Cluster | distance |
|---|---|---|
Robert Williams III | 2 | 7.4501 |
Mitchell Robinson | 2 | 7.3269 |
Clint Capela | 2 | 6.4898 |
outlier_k2c2 <- dist_slice4 %>% select(Name) %>% left_join(usage2)Joining with `by = join_by(Name)`
outlier_k2c2 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:15), width = .5)Name | POS | Team | GP | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Robert Williams III | C | bos | 61 | 61 | 29.6 | 10.0 | 2.0 | 1.0 | 0.9 | 3.9 | 5.7 | 2.2 | 2.2 | 4.4 | 6.0 | 0 | 0 | 1.1 | 1.5 | 22.10 | 1.649 |
Mitchell Robinson | C | ny | 72 | 62 | 25.7 | 8.5 | 0.5 | 0.8 | 0.8 | 4.1 | 4.5 | 1.8 | 2.7 | 3.6 | 4.8 | 0 | 0 | 1.2 | 2.5 | 20.78 | 1.778 |
Clint Capela | C | atl | 74 | 73 | 27.6 | 11.1 | 1.2 | 0.6 | 0.7 | 3.8 | 8.1 | 1.3 | 2.2 | 5.0 | 8.2 | 0 | 0 | 1.1 | 2.3 | 21.43 | 1.358 |
These cluster 2 outliers are all similar players. Robert Williams III, Mitchell Robinson, and Clint Capela are all big men. Like Myles Turner, they are players that play a lot of Games and Minutes, get lots of Rebounds and Blocks, but don’t shoot very much. Our data emphasizes shooting a lot and perhaps this leaves players like these without an appropriate cluster. They are borderline candidates that perhaps would benefit from another cluster.
# This below is from code that Dr. Sturdivant sent me. The cluster_analysis function produces a different size clusters than we got from the K-Means function
# set.seed(121)
# res_2means <- cluster_analysis(usage_rm,
# n = 2,
# method = "kmeans")
# # res_2means
# summary(res_2means)
#
# # predict(res_2means) # get clusters
# plot(res_2means)Now, let’s analyze the strength of K = 2 clusters. For reference, we’ve repeated the visualization below.
usage2fvizThe two clusters possess strong inter-class differences. For only two clusters, cluster 1 and cluster 2 are fairly distinct. The centers are far apart and demonstrate two different classifications of players. Cluster 1 is clearly a sub-population of starting, high-volume players and cluster 2 is a sub-population of bench players. Still, we’ve analyzed the outliers and found some players that could fall in either cluster. There could be some confusion for players like Robert Williams and Myles Turner. These players seem more similar to each other than most of the players in their own cluster. These outliers fall around (-2, -7). Check the visualizations again to see the cluster of players near there.
The intra-class similarity is fairly low. The clusters are large and have many outliers in each of the directions. Players like Giannis Antetokounmpo, Khris Middleton, and Myles Turner have little in common, but they are all grouped into cluster 1. Yet, most of cluster 1 produce larger values and most of cluster 2 have smaller numbers.
K = 3 Clusters - Interactive
# keep in case of reset
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)Now, let’s look at the consensus tests’ most popular number of clusters: K = 3. Here, we’d like you to produce your own analysis of the results. If you need help, look back at the K = 2 example.
As you progress, fill out this table with descriptors of the three clusters. This will be helpful for you as you try to identify their distinctions.
stu_table <- tibble(
Cluster = 1:3,
Description = "")
stu_table %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 2, width = 4)Cluster | Description |
|---|---|
1 | |
2 | |
3 |
Once again, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering.
This visualization plots the centers for each variable in a cluster. At a glance, this helps us to understand the characteristics of each cluster. We can see that cluster 2, for example, has high offensive rebounds and blocks per game, but low 3 point attempts and 3 point makes.
It can also tell us what variables are unimportant. If a variable has the similar mean throughout all three clusters, then the variable does not help us to distinguish between the clusters. If a variable has a large positive value in one cluster and a large negative value in another, then that variable is very useful for classifying our data.
# creates a dataset of each variable and the standardized center and graphs it
as_tibble(usage3Means$centers, rownames = "cluster") %>%
pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
mutate(variable = factor(variable, usage_levels)) %>%
ggplot(aes(x = variable, y = value, fill = cluster)) +
geom_bar(stat = "identity") +
facet_grid(rows = vars(cluster)) +
theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 10)) +
scale_y_continuous(position = "right") +
labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
theme(axis.text.y = element_blank(),
legend.position = "none")Before you analyze, remember that variables with a strong negative value still have large influence. It’s just a negative association with a variable instead of a positive association.
What do you notice about the variables? Which kinds of variables possess significant influence? Some variables have a strong influence in one cluster, but a weak influence in another cluster. Why is this?
After analyzing, would you choose to remove any variables from the data?
Is there a better way to look at the variables and remove the less influential ones?
We chose to remove the Games Played variable, because its influence was close to 0 in all three clusters. All of the other variables had a large effect in some category.
# reproducing K = 3 means without insignificant variables
set.seed(4)
usage3Means <- usageKMeans_prep %>%
select(-GP) %>%
kmeans(centers = 3, nstart = 50)
# creating a second usage without those variables so i don't have to reproduce it 800 million times.
usage3 <- usage %>% select(-GP)
usage_rm3 <- usage_rm %>% select(-GP)Now that we’ve removed some variables. Let’s see how many observations are within each cluster.
usage3Means$size %>% as_tibble() %>%
rename(Size = value) %>%
mutate(Cluster = 1:n()) %>%
relocate(Cluster, .before = Size) %>%
flextable() %>%
align(align = "center", part = "all")Cluster | Size |
|---|---|
1 | 102 |
2 | 61 |
3 | 211 |
What do you notice about the cluster size? What could this tell us about the clusters?
The clusters are not identical in size, but the clusters are each large enough that there is no reason to be concerned.
# un-standardizing and calculating the mean
usage3centers <- as_tibble(usage3Means$cluster) %>%
mutate(Name = usage$Name) %>%
rename(Clusters = value) %>% left_join(usage3, by = "Name") %>%
group_by(Clusters) %>%
summarise(
across(where(is.numeric), mean)
) %>%
mutate(across(where(is.numeric), round, digits = 3))
usage3centers %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = c(2:15), width = .5)Clusters | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF | SH-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 57.539 | 33.089 | 19.346 | 4.773 | 2.355 | 1.071 | 0.983 | 4.652 | 0.513 | 2.289 | 6.966 | 15.224 | 2.066 | 5.755 | 3.342 | 4.123 | 17.816 | 1.267 | 0.525 |
2 | 36.295 | 22.354 | 9.580 | 1.551 | 1.175 | 0.649 | 2.220 | 4.575 | 0.954 | 2.438 | 3.818 | 6.684 | 0.351 | 1.034 | 1.611 | 2.315 | 18.374 | 1.455 | 0.604 |
3 | 17.185 | 20.735 | 7.992 | 1.773 | 0.902 | 0.667 | 0.709 | 2.519 | 0.333 | 1.669 | 2.924 | 6.708 | 1.139 | 3.254 | 1.005 | 1.304 | 12.231 | 1.193 | 0.520 |
What do you notice about the cluster means? Without looking any further, how would you describe the three clusters? Jot down some notes in your table.
Now, let’s look at the clusters graphically.
usage3fviz <- fviz_cluster(usage3Means, usageKMeans_prep,
geom = "point",
show.clust.cent = TRUE, stand = FALSE,
pointsize = 1,
main = "Usage K = 3 Clusters")
usage3fvizWhat do you notice about the visualization? Are there a lot of observations that reside on the border? Where are the centers and outliers of each cluster?
Compare the new visualization with the K = 2 visualization. Where did the third cluster come from? What kinds of players?
If you were to create a fourth cluster, what points would you group together?
Let’s look at our prototype and outlier players. We’ve compiled them all into a table for you to compare and contrast.
# standardizing the distances between the players
usage3Means_scale <- as_tibble(usage3Means$centers) %>%
mutate(cluster = 1:3)
# creating appropriate tibble for distance formula
usage_fitted3Means <- usage3Means$cluster %>%
as_tibble() %>%
rename(cluster = value) %>% left_join(usage3Means_scale) %>% select(-cluster)Joining with `by = join_by(cluster)`
# distance from cluster center
distances <- sqrt(rowSums((usage_rm3 - usage_fitted3Means)^ 2)) %>%
as_tibble() %>%
rename(distance = value) %>%
mutate(
Name = usage$Name,
Cluster = usage3Means$cluster)
# creating a master document with all of the prototypes and all of the outliers.
master_distances <- distances %>%
group_by(Cluster) %>%
mutate(
outlier_rank = order(order(distance, decreasing=TRUE)),
proto_rank = order(order(distance, decreasing = FALSE))) %>%
filter(outlier_rank < 4 | proto_rank < 4) %>%
mutate(
Category = if_else(proto_rank < 4, "Prototype", "Outlier")
) %>%
select(Name, Cluster, Category) %>%
left_join(usage3) %>% arrange(Cluster, desc(Category))Joining with `by = join_by(Name)`
master_distances %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = 2, width = .8) %>%
width(j = c(4:15), width = .5)Name | Cluster | Category | POS | Team | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF | SH-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Miles Bridges | 1 | Prototype | SF | cha | 80 | 35.5 | 20.2 | 3.8 | 1.9 | 0.9 | 1.1 | 5.9 | 0.8 | 2.4 | 7.5 | 15.2 | 1.9 | 5.8 | 3.3 | 4.2 | 17.97 | 1.329 | 0.55 |
Malcolm Brogdon | 1 | Prototype | PG | ind | 36 | 33.5 | 19.1 | 5.9 | 2.1 | 0.8 | 0.9 | 4.2 | 0.4 | 2.0 | 6.8 | 15.1 | 1.6 | 5.2 | 4.0 | 4.6 | 18.10 | 1.265 | 0.50 |
Khris Middleton | 1 | Prototype | SF | mil | 66 | 32.4 | 20.1 | 5.4 | 2.9 | 1.2 | 0.6 | 4.8 | 0.3 | 2.4 | 6.8 | 15.5 | 2.5 | 6.6 | 3.9 | 4.4 | 18.19 | 1.298 | 0.52 |
Nikola Jokic | 1 | Outlier | C | den | 74 | 33.5 | 27.1 | 7.9 | 3.8 | 1.5 | 2.8 | 11.0 | 0.9 | 2.6 | 10.3 | 17.7 | 1.3 | 3.9 | 5.1 | 6.3 | 32.94 | 1.529 | 0.62 |
Giannis Antetokounmpo | 1 | Outlier | PF | mil | 67 | 32.9 | 29.9 | 5.8 | 3.3 | 1.1 | 2.0 | 9.6 | 1.4 | 3.2 | 10.3 | 18.6 | 1.1 | 3.6 | 8.3 | 11.4 | 32.12 | 1.608 | 0.58 |
Joel Embiid | 1 | Outlier | C | phi | 68 | 33.8 | 30.6 | 4.2 | 3.1 | 1.1 | 2.1 | 9.6 | 1.5 | 2.7 | 9.8 | 19.6 | 1.4 | 3.7 | 9.6 | 11.8 | 31.24 | 1.558 | 0.53 |
Nic Claxton | 2 | Prototype | PF | bkn | 19 | 20.7 | 8.7 | 0.9 | 0.8 | 0.5 | 1.9 | 3.7 | 1.1 | 2.3 | 3.8 | 5.6 | 0.0 | 0.0 | 1.1 | 2.0 | 18.66 | 1.553 | 0.67 |
Isaiah Roby | 2 | Prototype | PF | okc | 28 | 21.1 | 10.1 | 1.6 | 1.0 | 0.8 | 1.7 | 3.2 | 0.8 | 2.4 | 3.7 | 7.2 | 1.0 | 2.2 | 1.7 | 2.6 | 18.35 | 1.406 | 0.58 |
Richaun Holmes | 2 | Prototype | C | sac | 37 | 23.9 | 10.4 | 1.1 | 1.2 | 0.4 | 2.1 | 5.0 | 0.9 | 2.8 | 4.4 | 6.7 | 0.0 | 0.1 | 1.6 | 2.0 | 17.80 | 1.560 | 0.66 |
Robert Williams III | 2 | Outlier | C | bos | 61 | 29.6 | 10.0 | 2.0 | 1.0 | 0.9 | 3.9 | 5.7 | 2.2 | 2.2 | 4.4 | 6.0 | 0.0 | 0.0 | 1.1 | 1.5 | 22.10 | 1.649 | 0.74 |
Myles Turner | 2 | Outlier | C | ind | 42 | 29.4 | 12.9 | 1.0 | 1.3 | 0.7 | 1.5 | 5.5 | 2.8 | 2.8 | 4.8 | 9.4 | 1.5 | 4.4 | 1.9 | 2.5 | 17.45 | 1.374 | 0.59 |
Rudy Gobert | 2 | Outlier | C | utah | 66 | 32.1 | 15.6 | 1.1 | 1.8 | 0.7 | 3.7 | 11.0 | 2.1 | 2.7 | 5.5 | 7.7 | 0.0 | 0.1 | 4.6 | 6.7 | 24.76 | 2.022 | 0.71 |
Damion Lee | 3 | Prototype | SG | gs | 5 | 20.0 | 7.4 | 1.0 | 0.6 | 0.6 | 0.4 | 2.8 | 0.1 | 1.5 | 2.7 | 6.1 | 1.0 | 3.0 | 1.0 | 1.2 | 10.90 | 1.219 | 0.52 |
Ziaire Williams | 3 | Prototype | SG | mem | 31 | 21.7 | 8.1 | 1.0 | 0.7 | 0.6 | 0.4 | 1.7 | 0.2 | 1.8 | 3.1 | 6.8 | 1.2 | 3.9 | 0.7 | 0.9 | 9.70 | 1.182 | 0.54 |
Rudy Gay | 3 | Prototype | SF | utah | 1 | 18.9 | 8.1 | 1.0 | 0.9 | 0.5 | 1.0 | 3.4 | 0.3 | 1.7 | 2.9 | 6.9 | 1.3 | 3.7 | 1.1 | 1.4 | 13.06 | 1.177 | 0.51 |
Tomas Satoransky | 3 | Outlier | SG | no | 3 | 15.0 | 2.8 | 2.4 | 0.7 | 0.4 | 0.6 | 1.4 | 0.0 | 1.0 | 1.0 | 3.3 | 0.2 | 1.0 | 0.6 | 0.8 | 6.51 | 0.822 | 0.32 |
Robert Covington | 3 | Outlier | PF | por | 40 | 29.8 | 7.6 | 1.4 | 1.2 | 1.5 | 0.9 | 4.9 | 1.3 | 2.8 | 2.7 | 7.0 | 1.6 | 4.8 | 0.6 | 0.8 | 9.98 | 1.086 | 0.50 |
Buddy Hield1 | 3 | Outlier | SG | sac | 6 | 28.6 | 14.4 | 1.9 | 1.6 | 0.9 | 0.8 | 3.2 | 0.3 | 2.1 | 4.8 | 12.6 | 3.3 | 9.0 | 1.5 | 1.7 | 11.96 | 1.143 | 0.51 |
Here is a smaller table that may help you compare the players more easily.
master_distances1 <- distances %>%
group_by(Cluster) %>%
mutate(
outlier_rank = order(order(distance, decreasing=TRUE)),
proto_rank = order(order(distance, decreasing = FALSE))) %>%
filter(outlier_rank < 2 | proto_rank < 2) %>%
mutate(
Category = if_else(proto_rank < 2, "Prototype", "Outlier")
) %>%
select(Name, Cluster, Category) %>%
left_join(usage3) %>% arrange(desc(Category), Cluster)Joining with `by = join_by(Name)`
master_distances1 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = 2, width = .8) %>%
width(j = c(4:15), width = .5)Name | Cluster | Category | POS | Team | GS | MIN | PTS | AST | TO | STL | OR | DR | BLK | PF | FGM | FGA | 3PM | 3PA | FTM | FTA | PER | SC-EFF | SH-EFF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Khris Middleton | 1 | Prototype | SF | mil | 66 | 32.4 | 20.1 | 5.4 | 2.9 | 1.2 | 0.6 | 4.8 | 0.3 | 2.4 | 6.8 | 15.5 | 2.5 | 6.6 | 3.9 | 4.4 | 18.19 | 1.298 | 0.52 |
Isaiah Roby | 2 | Prototype | PF | okc | 28 | 21.1 | 10.1 | 1.6 | 1.0 | 0.8 | 1.7 | 3.2 | 0.8 | 2.4 | 3.7 | 7.2 | 1.0 | 2.2 | 1.7 | 2.6 | 18.35 | 1.406 | 0.58 |
Damion Lee | 3 | Prototype | SG | gs | 5 | 20.0 | 7.4 | 1.0 | 0.6 | 0.6 | 0.4 | 2.8 | 0.1 | 1.5 | 2.7 | 6.1 | 1.0 | 3.0 | 1.0 | 1.2 | 10.90 | 1.219 | 0.52 |
Joel Embiid | 1 | Outlier | C | phi | 68 | 33.8 | 30.6 | 4.2 | 3.1 | 1.1 | 2.1 | 9.6 | 1.5 | 2.7 | 9.8 | 19.6 | 1.4 | 3.7 | 9.6 | 11.8 | 31.24 | 1.558 | 0.53 |
Rudy Gobert | 2 | Outlier | C | utah | 66 | 32.1 | 15.6 | 1.1 | 1.8 | 0.7 | 3.7 | 11.0 | 2.1 | 2.7 | 5.5 | 7.7 | 0.0 | 0.1 | 4.6 | 6.7 | 24.76 | 2.022 | 0.71 |
Tomas Satoransky | 3 | Outlier | SG | no | 3 | 15.0 | 2.8 | 2.4 | 0.7 | 0.4 | 0.6 | 1.4 | 0.0 | 1.0 | 1.0 | 3.3 | 0.2 | 1.0 | 0.6 | 0.8 | 6.51 | 0.822 | 0.32 |
Use the above tables to summarize each of the 6 categories. What kind of players belong in each category? Is there a lot of variation within the prototypes? Is there a lot of variation within the outliers? Which of the outliers are closest to a different cluster? Would you reclassify any of the outliers?
After looking through the clusters, why do you think cluster 2 is so much smaller?
Let’s analyze the overall strength of K = 3 clusters. How does the intra-class similarity compare with K = 2? The inter-class similarity?
# usage3fvizComparing K = 2 to K = 3 - Mix
Often, it is interesting to compare the cluster results. Here, we tabulated the cluster assignments between K = 2 and K = 3. This can help us to see how the clustering with K = 2 overlaps with K = 3.
# creating a tibble of the cluster of each player for each K
clusters <- tibble(
player = usage$Name,
Cluster = usage2Means$cluster,
clus3 = usage3Means$cluster,
clus4 = usage4Means$cluster
)
# tabulating K = 2 and K = 3 clusters
compare_K2K3 <- with(clusters, table(Cluster, clus3)) %>%
as_tibble() %>%
pivot_wider(names_from = clus3, values_from = n)
# printing table using kable
compare_K2K3 %>%
flextable() %>%
align(align = "center", part = "all")Cluster | 1 | 2 | 3 |
|---|---|---|---|
1 | 102 | 13 | 4 |
2 | 0 | 48 | 207 |
What do you notice about the clustering distribution?
We can see that most players in cluster 1 from K = 2 stayed in cluster 1 when K = 3. We identified both of these clusters as the “starters,” so this makes a lot of intuitive sense. Most of cluster 2 from K = 2 moved into cluster 3 when K = 3. The interesting transition comes with the middle cluster of K = 3. This cluster is full of big men that don’t score a lot. They came from both cluster 1 and cluster 2 of K = 2. We saw this in our outlier analysis earlier.
Exercise 10
What are the benefits and costs of both K = 2 and K = 3? Which would you choose?
Part 6: Role Data Set
Now we move on to a second data set and we want to give you a lot more autonomy to test different clusters or outliers yourself. The data set is different, but the process is almost exactly the same. If you have questions, we’ll give you hints or you can look back to the usage data set for a clear example.
Remember the role data set? It contains variables aimed at categorizing the function and specific characteristics of the players. We hope to divide players into sub-groups like scorers, 3-point shooters, and rebounders.
Even though most of our data has been set to adjusted “per minute” quantities. It is still very important that we standardize the data first. Otherwise common values like points per minute will outweigh the effect of less common characteristics like blocks per minute. Now each variable is on the same scale. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.
We could also give a short mini lesson on the importance of standardizing using games started and blocks or something like that.
# initializing our datasets a second time in case student decides to remove a variable.
# For some reason, when I round to 3 digits, the elbow plot no longer suggests K = 7. This is very surprising. So I've decided to keep it rounding to 4 digits, because I have done so much work for K = 7.
role <- nba %>%
select(Name, POS, Team, Height, Weight, PTSPerMin, ASTPerMin, TOPerMin, STLPerMin, ORPerMin, DRPerMin, BLKPerMin, PFPerMin, FGP, FGMPerMin, FGAPerMin, `3PP`, `3PMPerMin`, `3PAPerMin`, FTP, FTMPerMin, FTAPerMin)
# standardizing the data for KMeans
roleKMeans_prep <- role %>%
mutate(across(where(is.numeric), standardize))
# displaying the standardized data for student
roleKMeans_prep %>%
slice(1:5) %>%
mutate(across(where(is.numeric), round, digits = 3)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:5), width = .6) %>%
width(j = c(6:12), width = .95)Name | POS | Team | Height | Weight | PTSPerMin | ASTPerMin | TOPerMin | STLPerMin | ORPerMin | DRPerMin | BLKPerMin | PFPerMin | FGP | FGMPerMin | FGAPerMin | 3PP | 3PMPerMin | 3PAPerMin | FTP | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trae Young | PG | atl | -1.655 | -1.496 | 2.824 | 3.171 | 2.751 | -0.499 | -0.698 | -0.944 | -1.083 | -1.288 | -0.085 | 2.237 | 2.300 | 0.525 | 1.354 | 1.164 | 1.376 | 3.112 | 2.499 |
John Collins | PF | atl | 0.842 | 0.774 | 0.620 | -0.708 | -0.773 | -1.019 | 0.269 | 1.024 | 0.734 | 0.474 | 0.820 | 0.857 | 0.313 | 0.357 | -0.356 | -0.443 | 0.259 | 0.360 | 0.275 |
Bogdan Bogdanovic | SG | atl | -0.094 | 0.155 | 0.539 | 0.129 | -0.692 | 0.469 | -0.780 | -0.392 | -0.840 | -0.457 | -0.483 | 0.426 | 0.757 | 0.394 | 1.468 | 1.427 | 0.762 | -0.405 | -0.529 |
De'Andre Hunter | SF | atl | 0.530 | 0.362 | 0.036 | -0.970 | -0.420 | -0.689 | -0.788 | -0.852 | -0.435 | 0.471 | -0.332 | -0.069 | 0.069 | 0.497 | -0.081 | -0.219 | -0.022 | 0.343 | 0.344 |
Kevin Huerter | SG | atl | 0.218 | -1.083 | -0.277 | -0.129 | -0.558 | -0.676 | -0.877 | -0.719 | -0.429 | 0.005 | -0.168 | -0.118 | -0.078 | 0.590 | 0.857 | 0.637 | 0.410 | -1.193 | -1.304 |
# finishing prepping data for KMeans procedure
roleKMeans_prep <- roleKMeans_prep %>%
column_to_rownames(var = "Name") %>%
select(-Team, -POS)Let’s check our Elbow plot to get an idea of the clustering.
# removing text for visualizations and standardizing
role_rm <- role %>%
select(-Name, -POS, -Team) %>%
mutate(across(where(is.numeric), standardize))
fviz_nbclust(role_rm, kmeans, method = "wss", k.max = 24) +
theme_minimal() +
labs(title = "The Elbow Method")Exercise 11
a) What do you see from the Elbow plot? At what point do the returns diminish?
b) How many clusters does the Elbow plot suggest?
# creates consensus clusters
roleClust <- n_clusters(role_rm,
package = c("easystats", "NbClust"),
standardize = FALSE, n_max = 10)plot(roleClust) +
labs(title = "Optimal Number of Clusters", x = "")There’s a lot of variation in the preferred number of clusters. How many clusters would you choose to analyze? How many values of K would you like to analyze? This is totally up to you. Feel free to move back and forth through this section to analyze the data as much as you like.
Exercise 12 (Maybe a final analysis for them to do?)
We will be using K = 7 for the trade scenario portion, so we recommend you review through K = 7.
give them space to choose
# assume that they want K = 7.
stu_cluster <- 7Ok, you’ve chosen K = 7. Here is an empty table for you to describe each of the clusters. As you grow in understanding of each of the clusters, fill it out with a few distinguishing words. Make sure you can glance at the table and understand what separates one cluster from another.
stu_role_table <- tibble(
Cluster = 1:stu_cluster,
Description = "")
stu_role_table %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 2, width = 4)Cluster | Description |
|---|---|
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 |
We’ll begin by looking at the mean for each variable of a cluster. Remember, this can help us identify variables that are not useful and get a general understanding of the characteristics of each cluster.
There may be a lot of variables, so we flipped the coordinates of the plot to make it easier to read. A bar to the right indicates a positive association and a bar to the left indicates a negative association.
set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_cluster, nstart = 50)# creating factor levels for role
role_levels <- colnames(role)
# creates a dataset of each variable and the standardized center and graphs it
as_tibble(roleKMeans$centers, rownames = "cluster") %>%
pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
mutate(variable = factor(variable, role_levels)) %>%
ggplot(aes(x = variable, y = value, fill = cluster)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_hline(yintercept = 0) +
facet_grid(cols = vars(cluster), switch = "both") +
labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
theme(axis.text.x = element_blank(),
legend.position = "none")Sift through the variables to see if any are unused throughout the clusters. If so, this indicates that the variable does not help differentiate the data into clusters. You can remove it here:
# if the student wants to remove a variable enter it here
role_var_rm <- 0
# reproducing roleKMeans without the removed variables
set.seed(100)
roleKMeans <- roleKMeans_prep %>%
select(-all_of(role_var_rm)) %>%
kmeans(centers = stu_cluster, nstart = 50)
role <- role %>% select(-role_var_rm)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(role_var_rm)
# Now:
data %>% select(all_of(role_var_rm))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
role_rm <- role_rm %>% select(-role_var_rm)If you chose a large number of clusters, it may be difficult to use this visualization to remove unimportant variables. Instead, you should be able to see some of the important attributes of each of the clusters. Be thinking of identifiers for each cluster. Which variables are important throughout?
Let’s begin to analyze the numeric values of the centers. Look through each cluster’s characteristics. What sticks out to you?
role_summary <- role %>% summarise(
across(where(is.numeric), mean)) %>%
mutate(
Clusters = "Data Average"
) %>% relocate(Clusters)
roleKcenters <- as_tibble(roleKMeans$cluster) %>%
mutate(Name = role$Name) %>%
rename(Clusters = value) %>% left_join(role, by = "Name") %>%
group_by(Clusters) %>%
summarise(
across(where(is.numeric), mean)
) %>%
mutate(
Clusters = as.character(Clusters)
) %>% bind_rows(role_summary) %>%
mutate(across(where(is.numeric), round, digits = 3),
Height = round(Height, digits = 1),
Weight = round(Height, digits = 1))
roleKcenters %>%
reactable(
defaultColDef = colDef(
cell = color_tiles(.)))Which clusters are scorers? Which are rebounders? Which have higher assist numbers? Higher 3-point shooting? Are any two clusters similar? What differentiates them?
At this point, give a short descriptor of each cluster. Each cluster should be uniquely described.
Let’s look at the size of each cluster.
roleKMeans$size %>% as_tibble() %>%
rename(Size = value) %>%
mutate(Cluster = 1:n()) %>%
relocate(Cluster, .before = Size) %>%
flextable() %>%
align(align = "center", part = "all")Cluster | Size |
|---|---|
1 | 51 |
2 | 45 |
3 | 96 |
4 | 26 |
5 | 20 |
6 | 38 |
7 | 98 |
Does this surprise you? Which clusters are large and small? Does this fit with your perception of the makeup of NBA teams?
Let’s look at the distribution of the players.
rolefviz <- fviz_cluster(roleKMeans, roleKMeans_prep,
geom = "point",
show.clust.cent = TRUE, stand = FALSE,
pointsize = 1,
main = "Role K Clusters")
rolefvizWhat do you notice from the visualization? Remember, the dimensions cannot represent all the data, so we may have clusters that overlap. Imagine that there is a third dimension “Z” that explains another 30%-40% of the data.
Where are the cluster centers and outliers? Which clusters seem to be the closest together? Furthest away? Are any clusters more isolated than others? Is this supported by your previous analysis?
If you had to add another cluster where would it be? If you had to remove a cluster, where would it be?
Let’s look at our prototype and outlier analysis.
First, we need to verify that our prototypes and outliers are prototypes and outliers. Now that we can change the number of clusters, its possible that you have some pretty small clusters. With a smaller sample size, we want to ensure that all our prototypes are indeed close to the cluster center and that all our outliers are indeed far away. In our K = 2 usage analysis, our prototypes were about 1-2.3 units away from the center. Our outliers were about 6-8.5. However, as K increases, the outlier distances should fall. Let’s look at the distances from the center of our top 3 prototypes and outliers from each cluster to see how they compare.
# standardizing the distances between the players
roleKMeans_scale <- as_tibble(roleKMeans$centers) %>%
mutate(cluster = 1:n())
# creating appropriate tibble for distance formula
role_fittedKMeans <- roleKMeans$cluster %>%
as_tibble() %>%
rename(cluster = value) %>% left_join(roleKMeans_scale) %>% select(-cluster)Joining with `by = join_by(cluster)`
# distance from cluster center
distances <- sqrt(rowSums((role_rm - role_fittedKMeans)^ 2)) %>%
as_tibble() %>%
rename(distance = value) %>%
mutate(
Name = role$Name,
Cluster = roleKMeans$cluster)
master_distances <- distances %>%
group_by(Cluster) %>%
mutate(
outlier_rank = order(order(distance, decreasing=TRUE)),
proto_rank = order(order(distance, decreasing = FALSE))) %>%
filter(outlier_rank < 4 | proto_rank < 4) %>%
mutate(
Category = if_else(proto_rank < 4, "Prototype", "Outlier")
) %>%
arrange(Cluster, distance) %>%
select(-outlier_rank, -proto_rank) %>%
relocate(distance, .after = Category) %>%
relocate(Name, .after = Category)
master_distances %>%
mutate(distance = round(distance, digits = 4)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 3, width = 1.3)Cluster | Category | Name | distance |
|---|---|---|---|
1 | Prototype | Trendon Watford | 1.8523 |
1 | Prototype | Isaiah Roby | 1.8794 |
1 | Prototype | John Collins | 2.0773 |
1 | Outlier | Isaiah Jackson | 5.3644 |
1 | Outlier | Tristan Thompson | 6.8757 |
1 | Outlier | Jakob Poeltl | 7.1035 |
2 | Prototype | Eric Bledsoe | 1.6470 |
2 | Prototype | Marcus Smart | 1.6596 |
2 | Prototype | Raul Neto | 1.7083 |
2 | Outlier | Josh Giddey | 3.7481 |
2 | Outlier | Jose Alvarado | 3.8303 |
2 | Outlier | Draymond Green | 5.0848 |
3 | Prototype | Coby White | 1.3340 |
3 | Prototype | Saddiq Bey | 1.4581 |
3 | Prototype | Lonnie Walker IV | 1.4612 |
3 | Outlier | Mike Muscala | 4.0470 |
3 | Outlier | Klay Thompson | 4.1396 |
3 | Outlier | Kevin Love | 4.4379 |
4 | Prototype | Ivica Zubac | 1.9336 |
4 | Prototype | Bismack Biyombo | 1.9389 |
4 | Prototype | Nic Claxton | 2.3272 |
4 | Outlier | Rudy Gobert | 4.6189 |
4 | Outlier | JaVale McGee | 4.6530 |
4 | Outlier | Thaddeus Young | 5.0444 |
5 | Prototype | Karl-Anthony Towns | 2.0831 |
5 | Prototype | Pascal Siakam | 2.4719 |
5 | Prototype | Jonas Valanciunas | 2.5886 |
5 | Outlier | Giannis Antetokounmpo | 5.5937 |
5 | Outlier | Joel Embiid | 5.9766 |
5 | Outlier | DeMarcus Cousins | 6.0388 |
6 | Prototype | Khris Middleton | 1.5916 |
6 | Prototype | Bradley Beal | 1.6040 |
6 | Prototype | Jaylen Brown | 1.8622 |
6 | Outlier | James Harden | 4.0727 |
6 | Outlier | Luka Doncic | 4.0863 |
6 | Outlier | Trae Young | 4.1980 |
7 | Prototype | Torrey Craig | 1.3072 |
7 | Prototype | Torrey Craig1 | 1.6834 |
7 | Prototype | CJ Elleby | 1.7221 |
7 | Outlier | Xavier Tillman | 4.2227 |
7 | Outlier | Thaddeus Young1 | 4.2333 |
7 | Outlier | Gary Payton II | 5.0684 |
Which prototypes are the strongest prototypes? Which prototypes do you trust the most? Which are the strongest outliers? Would you disqualify any outliers or prototypes from the analysis (i.e. a supposed outlier is not far enough from the center or a labeled prototype is too far from the center).
Is this too long? I could remove the two long outliers table and only use the shorter one?
If you wish to disqualify a player from analysis, do it here:
Provide a space for the student to remove player’s from the analysis. Assume student disqualifies Nic Claxton. Just for the heck of it.
disqualify <- c("Nic Claxton")roleKMeans$size %>% as_tibble() %>%
rename(Size = value) %>%
mutate(Cluster = 1:n()) %>%
relocate(Cluster, .before = Size) %>%
flextable() %>%
align(align = "center", part = "all")Cluster | Size |
|---|---|
1 | 51 |
2 | 45 |
3 | 96 |
4 | 26 |
5 | 20 |
6 | 38 |
7 | 98 |
Look again at the size of each cluster. Does this help explain any of your findings?
These outliers can be very different from each other. We’ll need to look into them to see what kind of players they are. Once again, we’ll show you the top 3 of each category first, and afterward a smaller table with only the top player.
# creating a master document with all of the prototypes and all of the outliers.
mast_dist_slice <- distances %>%
group_by(Cluster) %>%
mutate(
outlier_rank = order(order(distance, decreasing=TRUE)),
proto_rank = order(order(distance, decreasing = FALSE))) %>%
filter(outlier_rank < 4 | proto_rank < 4) %>%
mutate(
Category = if_else(proto_rank < 4, "Prototype", "Outlier")
) %>%
select(Name, Cluster, Category) %>%
left_join(role) %>% arrange(Cluster, desc(Category)) %>%
filter(Name != disqualify)Joining with `by = join_by(Name)`
mast_dist_slice %>%
mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(4:7), width = .6) %>%
width(j = c(8:14), width = .95)Name | Cluster | Category | POS | Team | Height | Weight | PTSPerMin | ASTPerMin | TOPerMin | STLPerMin | ORPerMin | DRPerMin | BLKPerMin | PFPerMin | FGP | FGMPerMin | FGAPerMin | 3PP | 3PMPerMin | 3PAPerMin | FTP | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
John Collins | 1 | Prototype | PF | atl | 81 | 235 | 0.526 | 0.058 | 0.036 | 0.019 | 0.055 | 0.198 | 0.032 | 0.097 | 0.526 | 0.205 | 0.386 | 0.364 | 0.039 | 0.107 | 0.793 | 0.081 | 0.101 |
Isaiah Roby | 1 | Prototype | PF | okc | 80 | 230 | 0.479 | 0.076 | 0.047 | 0.038 | 0.081 | 0.152 | 0.038 | 0.114 | 0.514 | 0.175 | 0.341 | 0.444 | 0.047 | 0.104 | 0.672 | 0.081 | 0.123 |
Trendon Watford | 1 | Prototype | PF | por | 81 | 240 | 0.420 | 0.094 | 0.050 | 0.028 | 0.066 | 0.166 | 0.033 | 0.133 | 0.532 | 0.166 | 0.309 | 0.237 | 0.011 | 0.044 | 0.755 | 0.083 | 0.110 |
Isaiah Jackson | 1 | Outlier | F | ind | 82 | 205 | 0.553 | 0.020 | 0.073 | 0.047 | 0.113 | 0.167 | 0.093 | 0.173 | 0.563 | 0.213 | 0.380 | 0.313 | 0.007 | 0.027 | 0.682 | 0.113 | 0.160 |
Tristan Thompson | 1 | Outlier | C | sac | 81 | 254 | 0.408 | 0.039 | 0.066 | 0.026 | 0.158 | 0.197 | 0.026 | 0.112 | 0.503 | 0.164 | 0.329 | 1.000 | 0.000 | 0.000 | 0.533 | 0.072 | 0.132 |
Jakob Poeltl | 1 | Outlier | C | sa | 85 | 245 | 0.466 | 0.097 | 0.055 | 0.024 | 0.134 | 0.190 | 0.059 | 0.107 | 0.618 | 0.207 | 0.338 | 1.000 | 0.000 | 0.000 | 0.495 | 0.048 | 0.097 |
Marcus Smart | 2 | Prototype | PG | bos | 75 | 220 | 0.375 | 0.183 | 0.068 | 0.053 | 0.019 | 0.099 | 0.009 | 0.071 | 0.418 | 0.130 | 0.313 | 0.331 | 0.053 | 0.158 | 0.793 | 0.062 | 0.077 |
Eric Bledsoe | 2 | Prototype | SG | lac | 73 | 214 | 0.393 | 0.167 | 0.083 | 0.052 | 0.020 | 0.115 | 0.016 | 0.063 | 0.421 | 0.143 | 0.345 | 0.313 | 0.036 | 0.119 | 0.761 | 0.063 | 0.087 |
Raul Neto | 2 | Prototype | PG | wsh | 73 | 180 | 0.383 | 0.158 | 0.056 | 0.041 | 0.010 | 0.087 | 0.000 | 0.077 | 0.463 | 0.148 | 0.321 | 0.292 | 0.026 | 0.087 | 0.769 | 0.061 | 0.077 |
Draymond Green | 2 | Outlier | PF | gs | 78 | 230 | 0.260 | 0.242 | 0.104 | 0.045 | 0.035 | 0.218 | 0.038 | 0.104 | 0.525 | 0.100 | 0.194 | 0.296 | 0.010 | 0.042 | 0.659 | 0.045 | 0.069 |
Jose Alvarado | 2 | Outlier | PG | no | 72 | 179 | 0.396 | 0.182 | 0.045 | 0.084 | 0.032 | 0.091 | 0.006 | 0.091 | 0.446 | 0.156 | 0.351 | 0.291 | 0.039 | 0.130 | 0.679 | 0.045 | 0.065 |
Josh Giddey | 2 | Outlier | SG | okc | 80 | 205 | 0.397 | 0.203 | 0.102 | 0.029 | 0.057 | 0.190 | 0.013 | 0.051 | 0.419 | 0.165 | 0.394 | 0.263 | 0.032 | 0.124 | 0.709 | 0.032 | 0.048 |
Coby White | 3 | Prototype | PG | chi | 77 | 195 | 0.462 | 0.105 | 0.040 | 0.018 | 0.011 | 0.098 | 0.007 | 0.080 | 0.433 | 0.167 | 0.385 | 0.385 | 0.080 | 0.211 | 0.857 | 0.047 | 0.055 |
Saddiq Bey | 3 | Prototype | SF | det | 79 | 215 | 0.488 | 0.085 | 0.036 | 0.027 | 0.039 | 0.124 | 0.006 | 0.048 | 0.396 | 0.167 | 0.421 | 0.346 | 0.079 | 0.224 | 0.827 | 0.079 | 0.094 |
Lonnie Walker IV | 3 | Prototype | G | sa | 76 | 204 | 0.526 | 0.096 | 0.043 | 0.026 | 0.013 | 0.100 | 0.013 | 0.061 | 0.407 | 0.191 | 0.474 | 0.314 | 0.070 | 0.217 | 0.784 | 0.074 | 0.091 |
Kevin Love | 3 | Outlier | PF | cle | 80 | 251 | 0.604 | 0.098 | 0.058 | 0.018 | 0.053 | 0.271 | 0.009 | 0.062 | 0.430 | 0.196 | 0.458 | 0.392 | 0.111 | 0.284 | 0.838 | 0.098 | 0.120 |
Klay Thompson | 3 | Outlier | SG | gs | 78 | 215 | 0.694 | 0.095 | 0.044 | 0.017 | 0.017 | 0.116 | 0.017 | 0.058 | 0.429 | 0.262 | 0.609 | 0.385 | 0.122 | 0.316 | 0.902 | 0.048 | 0.054 |
Mike Muscala | 3 | Outlier | C | okc | 82 | 240 | 0.580 | 0.036 | 0.022 | 0.029 | 0.036 | 0.181 | 0.043 | 0.094 | 0.456 | 0.188 | 0.420 | 0.429 | 0.116 | 0.275 | 0.842 | 0.080 | 0.094 |
Ivica Zubac | 4 | Prototype | C | lac | 84 | 240 | 0.422 | 0.066 | 0.061 | 0.020 | 0.119 | 0.230 | 0.041 | 0.111 | 0.626 | 0.168 | 0.266 | 0.000 | 0.000 | 0.000 | 0.727 | 0.090 | 0.123 |
Bismack Biyombo | 4 | Prototype | C | phx | 80 | 255 | 0.411 | 0.043 | 0.050 | 0.021 | 0.128 | 0.206 | 0.050 | 0.135 | 0.593 | 0.170 | 0.284 | 0.000 | 0.000 | 0.000 | 0.535 | 0.078 | 0.142 |
JaVale McGee | 4 | Outlier | C | phx | 84 | 270 | 0.582 | 0.038 | 0.082 | 0.019 | 0.139 | 0.285 | 0.070 | 0.152 | 0.629 | 0.247 | 0.392 | 0.222 | 0.000 | 0.006 | 0.699 | 0.089 | 0.127 |
Thaddeus Young | 4 | Outlier | PF | sa | 80 | 235 | 0.430 | 0.162 | 0.085 | 0.063 | 0.106 | 0.141 | 0.021 | 0.106 | 0.578 | 0.197 | 0.345 | 0.000 | 0.000 | 0.014 | 0.455 | 0.028 | 0.056 |
Rudy Gobert | 4 | Outlier | C | utah | 85 | 258 | 0.486 | 0.034 | 0.056 | 0.022 | 0.115 | 0.343 | 0.065 | 0.084 | 0.713 | 0.171 | 0.240 | 0.000 | 0.000 | 0.003 | 0.690 | 0.143 | 0.209 |
Karl-Anthony Towns | 5 | Prototype | C | min | 83 | 248 | 0.737 | 0.108 | 0.093 | 0.030 | 0.078 | 0.216 | 0.033 | 0.108 | 0.529 | 0.260 | 0.491 | 0.410 | 0.060 | 0.147 | 0.822 | 0.156 | 0.189 |
Jonas Valanciunas | 5 | Prototype | C | no | 83 | 265 | 0.587 | 0.086 | 0.079 | 0.020 | 0.102 | 0.274 | 0.026 | 0.109 | 0.544 | 0.228 | 0.419 | 0.361 | 0.026 | 0.069 | 0.820 | 0.106 | 0.129 |
Pascal Siakam | 5 | Prototype | PF | tor | 81 | 230 | 0.602 | 0.140 | 0.071 | 0.034 | 0.050 | 0.174 | 0.016 | 0.087 | 0.494 | 0.232 | 0.470 | 0.344 | 0.029 | 0.084 | 0.749 | 0.111 | 0.148 |
DeMarcus Cousins | 5 | Outlier | C | den | 82 | 270 | 0.640 | 0.122 | 0.158 | 0.043 | 0.115 | 0.281 | 0.029 | 0.216 | 0.456 | 0.216 | 0.475 | 0.324 | 0.058 | 0.173 | 0.736 | 0.151 | 0.201 |
Giannis Antetokounmpo | 5 | Outlier | PF | mil | 83 | 242 | 0.909 | 0.176 | 0.100 | 0.033 | 0.061 | 0.292 | 0.043 | 0.097 | 0.553 | 0.313 | 0.565 | 0.293 | 0.033 | 0.109 | 0.722 | 0.252 | 0.347 |
Joel Embiid | 5 | Outlier | C | phi | 84 | 280 | 0.905 | 0.124 | 0.092 | 0.033 | 0.062 | 0.284 | 0.044 | 0.080 | 0.499 | 0.290 | 0.580 | 0.371 | 0.041 | 0.109 | 0.814 | 0.284 | 0.349 |
Jaylen Brown | 6 | Prototype | SG | bos | 78 | 223 | 0.702 | 0.104 | 0.080 | 0.033 | 0.024 | 0.158 | 0.009 | 0.074 | 0.473 | 0.259 | 0.548 | 0.358 | 0.074 | 0.208 | 0.758 | 0.110 | 0.143 |
Khris Middleton | 6 | Prototype | SF | mil | 79 | 222 | 0.620 | 0.167 | 0.090 | 0.037 | 0.019 | 0.148 | 0.009 | 0.074 | 0.443 | 0.210 | 0.478 | 0.373 | 0.077 | 0.204 | 0.890 | 0.120 | 0.136 |
Bradley Beal | 6 | Prototype | SG | wsh | 75 | 207 | 0.644 | 0.183 | 0.094 | 0.025 | 0.028 | 0.106 | 0.011 | 0.067 | 0.451 | 0.242 | 0.536 | 0.300 | 0.044 | 0.147 | 0.833 | 0.117 | 0.142 |
Trae Young | 6 | Outlier | PG | atl | 73 | 180 | 0.814 | 0.278 | 0.115 | 0.026 | 0.020 | 0.089 | 0.003 | 0.049 | 0.460 | 0.269 | 0.582 | 0.382 | 0.089 | 0.229 | 0.904 | 0.189 | 0.209 |
James Harden | 6 | Outlier | SG | bkn | 77 | 220 | 0.608 | 0.276 | 0.130 | 0.035 | 0.027 | 0.189 | 0.019 | 0.065 | 0.414 | 0.178 | 0.432 | 0.332 | 0.062 | 0.189 | 0.869 | 0.186 | 0.216 |
Luka Doncic | 6 | Outlier | PG | dal | 79 | 230 | 0.802 | 0.246 | 0.127 | 0.034 | 0.025 | 0.234 | 0.017 | 0.062 | 0.457 | 0.280 | 0.610 | 0.353 | 0.088 | 0.249 | 0.744 | 0.158 | 0.212 |
Torrey Craig | 7 | Prototype | SF | ind | 79 | 221 | 0.320 | 0.054 | 0.039 | 0.025 | 0.059 | 0.133 | 0.020 | 0.094 | 0.456 | 0.123 | 0.271 | 0.333 | 0.044 | 0.133 | 0.771 | 0.025 | 0.034 |
Torrey Craig1 | 7 | Prototype | SF | phx | 79 | 221 | 0.332 | 0.058 | 0.048 | 0.038 | 0.048 | 0.159 | 0.029 | 0.101 | 0.450 | 0.130 | 0.284 | 0.323 | 0.053 | 0.173 | 0.706 | 0.019 | 0.029 |
CJ Elleby | 7 | Prototype | SG | por | 78 | 200 | 0.287 | 0.074 | 0.050 | 0.030 | 0.054 | 0.139 | 0.015 | 0.099 | 0.393 | 0.104 | 0.262 | 0.294 | 0.030 | 0.109 | 0.714 | 0.050 | 0.069 |
Gary Payton II | 7 | Outlier | SG | gs | 75 | 195 | 0.403 | 0.051 | 0.034 | 0.080 | 0.057 | 0.142 | 0.017 | 0.102 | 0.616 | 0.170 | 0.273 | 0.358 | 0.034 | 0.097 | 0.603 | 0.028 | 0.045 |
Xavier Tillman | 7 | Outlier | C | mem | 80 | 245 | 0.364 | 0.091 | 0.045 | 0.068 | 0.091 | 0.136 | 0.023 | 0.091 | 0.454 | 0.136 | 0.311 | 0.204 | 0.015 | 0.068 | 0.648 | 0.068 | 0.098 |
Thaddeus Young1 | 7 | Outlier | PF | tor | 80 | 235 | 0.344 | 0.093 | 0.044 | 0.066 | 0.082 | 0.158 | 0.022 | 0.093 | 0.465 | 0.142 | 0.301 | 0.395 | 0.038 | 0.093 | 0.481 | 0.027 | 0.055 |
Below is the smaller table.
mast_dist_slice1 <- distances %>%
group_by(Cluster) %>%
mutate(
outlier_rank = order(order(distance, decreasing=TRUE)),
proto_rank = order(order(distance, decreasing = FALSE))) %>%
filter(outlier_rank < 2 | proto_rank < 2) %>%
mutate(
Category = if_else(proto_rank < 2, "Prototype", "Outlier")
) %>%
select(Name, Cluster, Category) %>%
left_join(role) %>% arrange(desc(Category), Cluster) %>%
filter(Name != disqualify)Joining with `by = join_by(Name)`
mast_dist_slice1 %>%
mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(4:7), width = .6) %>%
width(j = c(8:14), width = .95)Name | Cluster | Category | POS | Team | Height | Weight | PTSPerMin | ASTPerMin | TOPerMin | STLPerMin | ORPerMin | DRPerMin | BLKPerMin | PFPerMin | FGP | FGMPerMin | FGAPerMin | 3PP | 3PMPerMin | 3PAPerMin | FTP | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trendon Watford | 1 | Prototype | PF | por | 81 | 240 | 0.420 | 0.094 | 0.050 | 0.028 | 0.066 | 0.166 | 0.033 | 0.133 | 0.532 | 0.166 | 0.309 | 0.237 | 0.011 | 0.044 | 0.755 | 0.083 | 0.110 |
Eric Bledsoe | 2 | Prototype | SG | lac | 73 | 214 | 0.393 | 0.167 | 0.083 | 0.052 | 0.020 | 0.115 | 0.016 | 0.063 | 0.421 | 0.143 | 0.345 | 0.313 | 0.036 | 0.119 | 0.761 | 0.063 | 0.087 |
Coby White | 3 | Prototype | PG | chi | 77 | 195 | 0.462 | 0.105 | 0.040 | 0.018 | 0.011 | 0.098 | 0.007 | 0.080 | 0.433 | 0.167 | 0.385 | 0.385 | 0.080 | 0.211 | 0.857 | 0.047 | 0.055 |
Ivica Zubac | 4 | Prototype | C | lac | 84 | 240 | 0.422 | 0.066 | 0.061 | 0.020 | 0.119 | 0.230 | 0.041 | 0.111 | 0.626 | 0.168 | 0.266 | 0.000 | 0.000 | 0.000 | 0.727 | 0.090 | 0.123 |
Karl-Anthony Towns | 5 | Prototype | C | min | 83 | 248 | 0.737 | 0.108 | 0.093 | 0.030 | 0.078 | 0.216 | 0.033 | 0.108 | 0.529 | 0.260 | 0.491 | 0.410 | 0.060 | 0.147 | 0.822 | 0.156 | 0.189 |
Khris Middleton | 6 | Prototype | SF | mil | 79 | 222 | 0.620 | 0.167 | 0.090 | 0.037 | 0.019 | 0.148 | 0.009 | 0.074 | 0.443 | 0.210 | 0.478 | 0.373 | 0.077 | 0.204 | 0.890 | 0.120 | 0.136 |
Torrey Craig | 7 | Prototype | SF | ind | 79 | 221 | 0.320 | 0.054 | 0.039 | 0.025 | 0.059 | 0.133 | 0.020 | 0.094 | 0.456 | 0.123 | 0.271 | 0.333 | 0.044 | 0.133 | 0.771 | 0.025 | 0.034 |
Jakob Poeltl | 1 | Outlier | C | sa | 85 | 245 | 0.466 | 0.097 | 0.055 | 0.024 | 0.134 | 0.190 | 0.059 | 0.107 | 0.618 | 0.207 | 0.338 | 1.000 | 0.000 | 0.000 | 0.495 | 0.048 | 0.097 |
Draymond Green | 2 | Outlier | PF | gs | 78 | 230 | 0.260 | 0.242 | 0.104 | 0.045 | 0.035 | 0.218 | 0.038 | 0.104 | 0.525 | 0.100 | 0.194 | 0.296 | 0.010 | 0.042 | 0.659 | 0.045 | 0.069 |
Kevin Love | 3 | Outlier | PF | cle | 80 | 251 | 0.604 | 0.098 | 0.058 | 0.018 | 0.053 | 0.271 | 0.009 | 0.062 | 0.430 | 0.196 | 0.458 | 0.392 | 0.111 | 0.284 | 0.838 | 0.098 | 0.120 |
Thaddeus Young | 4 | Outlier | PF | sa | 80 | 235 | 0.430 | 0.162 | 0.085 | 0.063 | 0.106 | 0.141 | 0.021 | 0.106 | 0.578 | 0.197 | 0.345 | 0.000 | 0.000 | 0.014 | 0.455 | 0.028 | 0.056 |
DeMarcus Cousins | 5 | Outlier | C | den | 82 | 270 | 0.640 | 0.122 | 0.158 | 0.043 | 0.115 | 0.281 | 0.029 | 0.216 | 0.456 | 0.216 | 0.475 | 0.324 | 0.058 | 0.173 | 0.736 | 0.151 | 0.201 |
Trae Young | 6 | Outlier | PG | atl | 73 | 180 | 0.814 | 0.278 | 0.115 | 0.026 | 0.020 | 0.089 | 0.003 | 0.049 | 0.460 | 0.269 | 0.582 | 0.382 | 0.089 | 0.229 | 0.904 | 0.189 | 0.209 |
Gary Payton II | 7 | Outlier | SG | gs | 75 | 195 | 0.403 | 0.051 | 0.034 | 0.080 | 0.057 | 0.142 | 0.017 | 0.102 | 0.616 | 0.170 | 0.273 | 0.358 | 0.034 | 0.097 | 0.603 | 0.028 | 0.045 |
Look through the prototypes and outliers. Compare their results with your previous findings. Do the prototypes of each cluster match up with your summary of the cluster? How do the outliers fit in? Two outliers can be very different. Pick a few outliers and determine their closest two clusters.
rolefvizAnalyze the K = 7 clusters as a whole. Are the clusters good? Do they have high intra-class similarity? What about a low intra-class similarity? If you were to do the analysis again, would you choose the same amount of clusters?
Compare lots of Ks
Select two values of K (between 2 and 10) to compare. This table can become very complex. Remember, the rows are the cluster assignment with the first value of K and the columns are the cluster assignment with the second value. Isolate and analyze one row or column at a time.
# let's say the student wants to compare K = 3 and K = 7
stu_clus1 <- 7
stu_clus2 <- 3
# ensures that the first chosen cluster is lower.
if(stu_clus1 > stu_clus2) {
space = stu_clus1
stu_clus1 = stu_clus2
stu_clus2 = space
}
set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_clus1, nstart = 50)
set.seed(100)
roleK2Means <- kmeans(roleKMeans_prep, centers = stu_clus2, nstart = 50)
# creating a tibble of the cluster of each player for each K
clusters <- tibble(
player = role$Name,
Cluster = roleKMeans$cluster,
clusK2 = roleK2Means$cluster
)
compare_table <- with(clusters, table(Cluster, clusK2)) %>%
as_tibble() %>%
pivot_wider(names_from = clusK2, values_from = n)
# tabulating clusters
compare_table %>%
flextable() %>%
align(align = "center", part = "all")Cluster | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
1 | 0 | 9 | 17 | 0 | 8 | 38 | 0 |
2 | 1 | 35 | 79 | 0 | 0 | 0 | 90 |
3 | 50 | 1 | 0 | 26 | 12 | 0 | 8 |
Part 7: GM of Dallas Mavericks
Returning back to the Dallas Mavericks. Let’s take a look at how the Mavericks players were clustered in our role dataset. Let’s use K = 7. If you did not analyze K = 7 earlier, it is worth a look.
Below are a few visual reminders of each cluster’s characteristics.
# initializing our datasets a third time in case student decided to remove a variable
role <- nba %>%
select(Name, POS, Team, Height, Weight, FGP, `3PP`, FTP, PTSPerMin, ORPerMin, DRPerMin, ASTPerMin, STLPerMin, BLKPerMin, TOPerMin, PFPerMin, FGMPerMin, FGAPerMin, `3PMPerMin`, `3PAPerMin`, FTMPerMin, FTAPerMin) %>%
mutate(across(where(is.numeric), round, digits = 4))
# standardizing the data for KMeans
roleKMeans_prep <- role %>%
mutate(across(where(is.numeric), standardize)) %>%
column_to_rownames(var = "Name") %>%
select(-Team, -POS)
# creating K = 7 K-Means
set.seed(100)
role7Means <- kmeans(roleKMeans_prep, centers = 7, nstart = 50)
# bar graph of centers
as_tibble(role7Means$centers, rownames = "cluster") %>%
pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
mutate(variable = factor(variable, role_levels)) %>%
ggplot(aes(x = variable, y = value, fill = cluster)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = 0) +
coord_flip() +
facet_grid(cols = vars(cluster), switch = "both") +
labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
theme(axis.text.x = element_blank(),
legend.position = "none")# creating tibble of all the centers
role7centers <- as_tibble(role7Means$cluster) %>%
mutate(Name = role$Name) %>%
rename(Clusters = value) %>% left_join(role, by = "Name") %>%
group_by(Clusters) %>%
summarise(
across(where(is.numeric), mean)) %>%
mutate(Clusters = as.character(Clusters)) %>%
bind_rows(role_summary) %>%
mutate(across(where(is.numeric), round, digits = 3),
Height = round(Height, digits = 1),
Weight = round(Height, digits = 1))
# printing conditional formatting table
role7centers %>%
reactable(
defaultColDef = colDef(
cell = color_tiles(.)
))Before moving on, fill out this table to describe each cluster. Write a few descriptive words that distinguish each cluster. This will help you to organize your thoughts on each cluster. If you already completed this for K = 7 in the role dataset, then you are free to proceed.
stu_table <- tibble(
Cluster = 1:7,
Description = "")
stu_table %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 2, width = 4)Cluster | Description |
|---|---|
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 |
Caleb’s estimation of 7 clusters. I’d like to provide them a blank table to fill out somehow. Like a text file table with two columns.
Caleb_table <- tibble(
Cluster = 1:7,
Description = c("big men, mediocre scorers, kinda shoot deep",
"small point guards, facilitaters",
"meh players, 3 point shooters",
"big men, can't shoot deep at all",
"high-volume players, generally tall",
"high-volume players, average height",
"low production, very mediocre, likely corner 3 players"))
Caleb_table %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 2, width = 4)Cluster | Description |
|---|---|
1 | big men, mediocre scorers, kinda shoot deep |
2 | small point guards, facilitaters |
3 | meh players, 3 point shooters |
4 | big men, can't shoot deep at all |
5 | high-volume players, generally tall |
6 | high-volume players, average height |
7 | low production, very mediocre, likely corner 3 players |
Mavericks Offseason Analysis
Now, let’s look at the cluster assignments of our ten Dallas Mavericks players.
role7Means_players <- role7Means$cluster %>%
as_tibble() %>%
rename(Cluster = value) %>%
mutate(
Name = role$Name
) %>%
left_join(role, by = "Name") %>%
left_join(usage %>% select(Name, MIN), by = "Name") %>%
relocate(Cluster, .after = Name) %>%
relocate(MIN, .after = POS) %>%
arrange(Cluster)
dallas_role2022 <- role7Means_players %>%
filter(Team == "dal") %>%
select(-Team)
dallas_role2022 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:10), width = .6) %>%
width(j = c(11:14), width = .95)Name | Cluster | POS | MIN | Height | Weight | FGP | 3PP | FTP | PTSPerMin | ORPerMin | DRPerMin | ASTPerMin | STLPerMin | BLKPerMin | TOPerMin | PFPerMin | FGMPerMin | FGAPerMin | 3PMPerMin | 3PAPerMin | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dwight Powell | 1 | C | 21.9 | 82 | 240 | 0.671 | 0.351 | 0.783 | 0.3973 | 0.0959 | 0.1279 | 0.0548 | 0.0228 | 0.0228 | 0.0365 | 0.1233 | 0.1507 | 0.2237 | 0.0091 | 0.0228 | 0.0913 | 0.1187 |
Jalen Brunson | 2 | PG | 31.9 | 73 | 190 | 0.502 | 0.373 | 0.840 | 0.5110 | 0.0157 | 0.1066 | 0.1505 | 0.0251 | 0.0000 | 0.0502 | 0.0596 | 0.2006 | 0.4013 | 0.0376 | 0.1003 | 0.0721 | 0.0846 |
Tim Hardaway Jr. | 3 | SF | 29.6 | 77 | 205 | 0.394 | 0.336 | 0.757 | 0.4797 | 0.0101 | 0.1149 | 0.0743 | 0.0304 | 0.0034 | 0.0270 | 0.0608 | 0.1689 | 0.4257 | 0.0811 | 0.2432 | 0.0642 | 0.0845 |
Kristaps Porzingis | 5 | C | 29.5 | 87 | 240 | 0.451 | 0.283 | 0.865 | 0.6508 | 0.0644 | 0.1966 | 0.0678 | 0.0237 | 0.0576 | 0.0542 | 0.0881 | 0.2271 | 0.5051 | 0.0475 | 0.1729 | 0.1458 | 0.1695 |
Luka Doncic | 6 | PG | 35.4 | 79 | 230 | 0.457 | 0.353 | 0.744 | 0.8023 | 0.0254 | 0.2345 | 0.2458 | 0.0339 | 0.0169 | 0.1271 | 0.0621 | 0.2797 | 0.6102 | 0.0876 | 0.2486 | 0.1582 | 0.2119 |
Dorian Finney-Smith | 7 | PF | 33.1 | 79 | 220 | 0.471 | 0.395 | 0.675 | 0.3323 | 0.0453 | 0.0967 | 0.0574 | 0.0332 | 0.0151 | 0.0302 | 0.0695 | 0.1239 | 0.2628 | 0.0665 | 0.1631 | 0.0211 | 0.0302 |
Reggie Bullock | 7 | SF | 28.0 | 78 | 205 | 0.401 | 0.360 | 0.833 | 0.3071 | 0.0179 | 0.1107 | 0.0429 | 0.0214 | 0.0071 | 0.0214 | 0.0571 | 0.1071 | 0.2643 | 0.0750 | 0.2071 | 0.0214 | 0.0250 |
Maxi Kleber | 7 | PF | 24.6 | 82 | 240 | 0.398 | 0.325 | 0.708 | 0.2846 | 0.0488 | 0.1911 | 0.0488 | 0.0203 | 0.0407 | 0.0325 | 0.0935 | 0.0976 | 0.2439 | 0.0569 | 0.1748 | 0.0325 | 0.0447 |
Josh Green | 7 | SG | 15.5 | 77 | 200 | 0.508 | 0.359 | 0.689 | 0.3097 | 0.0516 | 0.1032 | 0.0774 | 0.0452 | 0.0129 | 0.0452 | 0.1097 | 0.1226 | 0.2452 | 0.0258 | 0.0774 | 0.0323 | 0.0452 |
Sterling Brown | 7 | SF | 12.8 | 77 | 219 | 0.381 | 0.304 | 0.933 | 0.2578 | 0.0391 | 0.1953 | 0.0547 | 0.0234 | 0.0078 | 0.0391 | 0.0859 | 0.0937 | 0.2500 | 0.0469 | 0.1484 | 0.0234 | 0.0234 |
What do you notice about the player assignments? How many clusters do the Mavericks have represented? Which cluster is the most common on the Mavericks team?
Why is cluster 7 the most common? What kind of player is in cluster 7?
The Mavericks experienced a bit of turnover in the 2022 offseason. They’d already traded away C Kristaps Porzingis for SG Spencer Dinwiddie at the end of the 2022 season, and they lost productive SG Jalen Brunson to free agency. They traded away SF Sterling Brown and other assets for C Christian Wood during the 2022 Summer.
Let’s assess the offseason moves of the Dallas Mavericks by looking at the opening day roster for 2023 and its cluster distribution. Below are the eleven players on the Dallas Mavericks roster at Game 1 of the 2023 season, a loss against the Phoenix Suns.
dallas_role2023 <- role7Means_players %>%
filter(Name == "JaVale McGee" | Name == "Reggie Bullock" | Name == "Dorian Finney-Smith" | Name == "Spencer Dinwiddie" | Name == "Luka Doncic" | Name == "Tim Hardaway Jr." | Name == "Maxi Kleber" | Name == "Christian Wood" | Name == "Josh Green" | Name == "Dwight Powell" | Name == "Davis Bertans") %>%
select(-Team) %>%
arrange(Cluster)
dallas_role2023 %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:10), width = .6) %>%
width(j = c(11:14), width = .95)Name | Cluster | POS | MIN | Height | Weight | FGP | 3PP | FTP | PTSPerMin | ORPerMin | DRPerMin | ASTPerMin | STLPerMin | BLKPerMin | TOPerMin | PFPerMin | FGMPerMin | FGAPerMin | 3PMPerMin | 3PAPerMin | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dwight Powell | 1 | C | 21.9 | 82 | 240 | 0.671 | 0.351 | 0.783 | 0.3973 | 0.0959 | 0.1279 | 0.0548 | 0.0228 | 0.0228 | 0.0365 | 0.1233 | 0.1507 | 0.2237 | 0.0091 | 0.0228 | 0.0913 | 0.1187 |
Tim Hardaway Jr. | 3 | SF | 29.6 | 77 | 205 | 0.394 | 0.336 | 0.757 | 0.4797 | 0.0101 | 0.1149 | 0.0743 | 0.0304 | 0.0034 | 0.0270 | 0.0608 | 0.1689 | 0.4257 | 0.0811 | 0.2432 | 0.0642 | 0.0845 |
Spencer Dinwiddie | 3 | PG | 30.2 | 77 | 215 | 0.376 | 0.310 | 0.811 | 0.4172 | 0.0265 | 0.1291 | 0.1921 | 0.0199 | 0.0066 | 0.0563 | 0.0795 | 0.1391 | 0.3709 | 0.0530 | 0.1689 | 0.0861 | 0.1093 |
Davis Bertans | 3 | SF | 14.7 | 82 | 225 | 0.351 | 0.319 | 0.933 | 0.3878 | 0.0136 | 0.1088 | 0.0340 | 0.0204 | 0.0136 | 0.0272 | 0.1088 | 0.1224 | 0.3401 | 0.0952 | 0.2857 | 0.0544 | 0.0612 |
JaVale McGee | 4 | C | 15.8 | 84 | 270 | 0.629 | 0.222 | 0.699 | 0.5823 | 0.1392 | 0.2848 | 0.0380 | 0.0190 | 0.0696 | 0.0823 | 0.1519 | 0.2468 | 0.3924 | 0.0000 | 0.0063 | 0.0886 | 0.1266 |
Christian Wood | 5 | C | 30.8 | 82 | 214 | 0.501 | 0.390 | 0.623 | 0.5812 | 0.0519 | 0.2760 | 0.0747 | 0.0260 | 0.0325 | 0.0617 | 0.0812 | 0.2110 | 0.4188 | 0.0617 | 0.1591 | 0.0974 | 0.1591 |
Luka Doncic | 6 | PG | 35.4 | 79 | 230 | 0.457 | 0.353 | 0.744 | 0.8023 | 0.0254 | 0.2345 | 0.2458 | 0.0339 | 0.0169 | 0.1271 | 0.0621 | 0.2797 | 0.6102 | 0.0876 | 0.2486 | 0.1582 | 0.2119 |
Dorian Finney-Smith | 7 | PF | 33.1 | 79 | 220 | 0.471 | 0.395 | 0.675 | 0.3323 | 0.0453 | 0.0967 | 0.0574 | 0.0332 | 0.0151 | 0.0302 | 0.0695 | 0.1239 | 0.2628 | 0.0665 | 0.1631 | 0.0211 | 0.0302 |
Reggie Bullock | 7 | SF | 28.0 | 78 | 205 | 0.401 | 0.360 | 0.833 | 0.3071 | 0.0179 | 0.1107 | 0.0429 | 0.0214 | 0.0071 | 0.0214 | 0.0571 | 0.1071 | 0.2643 | 0.0750 | 0.2071 | 0.0214 | 0.0250 |
Maxi Kleber | 7 | PF | 24.6 | 82 | 240 | 0.398 | 0.325 | 0.708 | 0.2846 | 0.0488 | 0.1911 | 0.0488 | 0.0203 | 0.0407 | 0.0325 | 0.0935 | 0.0976 | 0.2439 | 0.0569 | 0.1748 | 0.0325 | 0.0447 |
Josh Green | 7 | SG | 15.5 | 77 | 200 | 0.508 | 0.359 | 0.689 | 0.3097 | 0.0516 | 0.1032 | 0.0774 | 0.0452 | 0.0129 | 0.0452 | 0.1097 | 0.1226 | 0.2452 | 0.0258 | 0.0774 | 0.0323 | 0.0452 |
The roster looks somewhat similar, but what classification of player did the Mavericks lose in the 2022 season and not return in the 2023 season? What classification of player did the Mavericks gain in the 2023 season?
Answer: They lost a cluster 2 player, lost a cluster 7 player, gained two cluster 3 players, and a cluster 4 player.
What kind of player is in cluster 2? What would losing this kind of player do to a team?
Dallas Mavericks Trade
Let’s say you’re the GM of the Dallas Mavericks after game 1 of the 2022-2023 season. Which players would you consider trading and what cluster of player would you hope to acquire? Which players are you willing to give up?
Answer: I think the correct answer here is give up any of cluster 3 or 7 for a cluster 2. Maxi Kleber is the most expendable because he has some features of 1,4,5 and some of 7. And they have excess of these players.
Select four players you are willing to trade and one cluster that you are looking for.
# let's say the student is smart and chooses
trading <- c("Davis Bertans", "Spencer Dinwiddie", "Maxi Kleber", "Dwight Powell")
# and is looking for a player in cluster...
looking <- 2
looking_clus <- role7Means_players %>%
filter(Cluster == looking)
looking_clus %>%
flextable() %>%
align(align = "center", part = "all") %>%
width(j = 1, width = 1.3) %>%
width(j = c(2:5), width = .6) %>%
width(j = c(6:12), width = .95)Name | Cluster | POS | MIN | Team | Height | Weight | FGP | 3PP | FTP | PTSPerMin | ORPerMin | DRPerMin | ASTPerMin | STLPerMin | BLKPerMin | TOPerMin | PFPerMin | FGMPerMin | FGAPerMin | 3PMPerMin | 3PAPerMin | FTMPerMin | FTAPerMin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lou Williams | 2 | SG | 14.3 | atl | 73 | 175 | 0.391 | 0.363 | 0.859 | 0.4406 | 0.0210 | 0.0909 | 0.1329 | 0.0350 | 0.0070 | 0.0559 | 0.0629 | 0.1538 | 0.3986 | 0.0490 | 0.1259 | 0.0839 | 0.0979 |
Dennis Schroder | 2 | PG | 29.2 | bos | 75 | 172 | 0.440 | 0.349 | 0.848 | 0.4932 | 0.0205 | 0.0959 | 0.1438 | 0.0274 | 0.0034 | 0.0719 | 0.0822 | 0.1781 | 0.4075 | 0.0479 | 0.1336 | 0.0856 | 0.1027 |
Marcus Smart | 2 | PG | 32.3 | bos | 75 | 220 | 0.418 | 0.331 | 0.793 | 0.3746 | 0.0186 | 0.0991 | 0.1827 | 0.0526 | 0.0093 | 0.0681 | 0.0712 | 0.1300 | 0.3127 | 0.0526 | 0.1579 | 0.0619 | 0.0774 |
Ish Smith | 2 | PG | 13.8 | cha | 72 | 175 | 0.395 | 0.400 | 0.632 | 0.3261 | 0.0217 | 0.0870 | 0.1884 | 0.0362 | 0.0217 | 0.0725 | 0.0652 | 0.1449 | 0.3623 | 0.0217 | 0.0507 | 0.0217 | 0.0362 |
Lonzo Ball | 2 | PG | 34.6 | chi | 78 | 190 | 0.423 | 0.423 | 0.750 | 0.3757 | 0.0289 | 0.1272 | 0.1474 | 0.0520 | 0.0260 | 0.0665 | 0.0694 | 0.1329 | 0.3150 | 0.0896 | 0.2139 | 0.0173 | 0.0231 |
Alex Caruso | 2 | SG | 28.0 | chi | 76 | 186 | 0.398 | 0.333 | 0.795 | 0.2643 | 0.0286 | 0.1000 | 0.1429 | 0.0607 | 0.0143 | 0.0500 | 0.0929 | 0.0893 | 0.2214 | 0.0357 | 0.1107 | 0.0500 | 0.0643 |
Ricky Rubio | 2 | PG | 28.5 | cle | 75 | 190 | 0.363 | 0.339 | 0.854 | 0.4596 | 0.0140 | 0.1298 | 0.2316 | 0.0491 | 0.0070 | 0.0912 | 0.0772 | 0.1544 | 0.4246 | 0.0596 | 0.1789 | 0.0912 | 0.1053 |
Brandon Goodwin | 2 | G | 13.9 | cle | 72 | 180 | 0.416 | 0.345 | 0.632 | 0.3453 | 0.0288 | 0.1079 | 0.1799 | 0.0504 | 0.0000 | 0.0719 | 0.0791 | 0.1295 | 0.3094 | 0.0360 | 0.1079 | 0.0504 | 0.0791 |
Jalen Brunson | 2 | PG | 31.9 | dal | 73 | 190 | 0.502 | 0.373 | 0.840 | 0.5110 | 0.0157 | 0.1066 | 0.1505 | 0.0251 | 0.0000 | 0.0502 | 0.0596 | 0.2006 | 0.4013 | 0.0376 | 0.1003 | 0.0721 | 0.0846 |
Facundo Campazzo | 2 | PG | 18.2 | den | 70 | 195 | 0.361 | 0.301 | 0.769 | 0.2802 | 0.0220 | 0.0769 | 0.1868 | 0.0549 | 0.0220 | 0.0549 | 0.1044 | 0.0879 | 0.2527 | 0.0495 | 0.1648 | 0.0495 | 0.0659 |
Cory Joseph | 2 | PG | 24.6 | det | 75 | 200 | 0.445 | 0.414 | 0.885 | 0.3252 | 0.0163 | 0.0894 | 0.1463 | 0.0244 | 0.0122 | 0.0528 | 0.0935 | 0.1098 | 0.2520 | 0.0407 | 0.0976 | 0.0610 | 0.0691 |
Killian Hayes | 2 | PG | 25.0 | det | 77 | 195 | 0.383 | 0.263 | 0.770 | 0.2760 | 0.0200 | 0.1040 | 0.1680 | 0.0480 | 0.0200 | 0.0680 | 0.1120 | 0.1080 | 0.2800 | 0.0280 | 0.1000 | 0.0360 | 0.0440 |
Saben Lee | 2 | PG | 16.3 | det | 74 | 183 | 0.390 | 0.233 | 0.789 | 0.3436 | 0.0307 | 0.1166 | 0.1779 | 0.0613 | 0.0184 | 0.0613 | 0.0736 | 0.1166 | 0.2945 | 0.0245 | 0.0982 | 0.0920 | 0.1166 |
Draymond Green | 2 | PF | 28.9 | gs | 78 | 230 | 0.525 | 0.296 | 0.659 | 0.2595 | 0.0346 | 0.2180 | 0.2422 | 0.0450 | 0.0381 | 0.1038 | 0.1038 | 0.1003 | 0.1938 | 0.0104 | 0.0415 | 0.0450 | 0.0692 |
Kevin Porter Jr. | 2 | SG | 31.3 | hou | 76 | 203 | 0.415 | 0.375 | 0.642 | 0.4984 | 0.0224 | 0.1182 | 0.1981 | 0.0351 | 0.0128 | 0.0990 | 0.0831 | 0.1757 | 0.4217 | 0.0799 | 0.2173 | 0.0639 | 0.1022 |
Josh Christopher | 2 | SG | 18.0 | hou | 77 | 215 | 0.448 | 0.296 | 0.735 | 0.4389 | 0.0389 | 0.1000 | 0.1111 | 0.0500 | 0.0111 | 0.0833 | 0.0722 | 0.1667 | 0.3778 | 0.0444 | 0.1444 | 0.0611 | 0.0833 |
D.J. Augustin | 2 | G | 15.0 | hou | 71 | 183 | 0.404 | 0.406 | 0.868 | 0.3600 | 0.0133 | 0.0667 | 0.1467 | 0.0200 | 0.0000 | 0.0867 | 0.0333 | 0.1067 | 0.2667 | 0.0733 | 0.1867 | 0.0667 | 0.0733 |
Tyrese Haliburton | 2 | PG | 36.1 | ind | 77 | 185 | 0.502 | 0.416 | 0.849 | 0.4848 | 0.0222 | 0.0970 | 0.2659 | 0.0499 | 0.0166 | 0.0886 | 0.0526 | 0.1717 | 0.3435 | 0.0609 | 0.1468 | 0.0776 | 0.0914 |
T.J. McConnell | 2 | PG | 24.1 | ind | 73 | 190 | 0.481 | 0.303 | 0.826 | 0.3527 | 0.0290 | 0.1079 | 0.2033 | 0.0456 | 0.0166 | 0.0456 | 0.0830 | 0.1535 | 0.3195 | 0.0166 | 0.0498 | 0.0290 | 0.0373 |
Keifer Sykes | 2 | G | 17.7 | ind | 71 | 167 | 0.363 | 0.300 | 0.882 | 0.3164 | 0.0169 | 0.0678 | 0.1073 | 0.0226 | 0.0056 | 0.0565 | 0.0904 | 0.1243 | 0.3333 | 0.0452 | 0.1582 | 0.0282 | 0.0282 |
Eric Bledsoe | 2 | SG | 25.2 | lac | 73 | 214 | 0.421 | 0.313 | 0.761 | 0.3929 | 0.0198 | 0.1151 | 0.1667 | 0.0516 | 0.0159 | 0.0833 | 0.0635 | 0.1429 | 0.3452 | 0.0357 | 0.1190 | 0.0635 | 0.0873 |
De'Anthony Melton | 2 | SG | 22.7 | mem | 74 | 200 | 0.404 | 0.374 | 0.750 | 0.4758 | 0.0396 | 0.1586 | 0.1189 | 0.0617 | 0.0220 | 0.0661 | 0.0793 | 0.1674 | 0.4185 | 0.0837 | 0.2247 | 0.0529 | 0.0705 |
Tyus Jones | 2 | PG | 21.2 | mem | 72 | 196 | 0.451 | 0.390 | 0.818 | 0.4104 | 0.0094 | 0.1038 | 0.2075 | 0.0425 | 0.0000 | 0.0283 | 0.0189 | 0.1604 | 0.3585 | 0.0519 | 0.1321 | 0.0330 | 0.0425 |
Kyle Lowry | 2 | PG | 33.9 | mia | 72 | 196 | 0.440 | 0.377 | 0.851 | 0.3953 | 0.0147 | 0.1180 | 0.2212 | 0.0324 | 0.0088 | 0.0796 | 0.0826 | 0.1298 | 0.2950 | 0.0678 | 0.1799 | 0.0678 | 0.0826 |
Gabe Vincent | 2 | PG | 23.4 | mia | 75 | 200 | 0.417 | 0.368 | 0.815 | 0.3718 | 0.0128 | 0.0641 | 0.1325 | 0.0385 | 0.0085 | 0.0598 | 0.0983 | 0.1325 | 0.3205 | 0.0769 | 0.2051 | 0.0256 | 0.0342 |
Jrue Holiday | 2 | PG | 32.9 | mil | 75 | 205 | 0.501 | 0.411 | 0.761 | 0.5562 | 0.0304 | 0.1064 | 0.2067 | 0.0486 | 0.0122 | 0.0821 | 0.0608 | 0.2158 | 0.4316 | 0.0608 | 0.1459 | 0.0608 | 0.0821 |
Patrick Beverley | 2 | PG | 25.4 | min | 73 | 180 | 0.406 | 0.343 | 0.722 | 0.3622 | 0.0433 | 0.1220 | 0.1811 | 0.0472 | 0.0354 | 0.0512 | 0.1181 | 0.1220 | 0.2953 | 0.0551 | 0.1654 | 0.0669 | 0.0906 |
Jordan McLaughlin | 2 | PG | 14.5 | min | 71 | 185 | 0.440 | 0.318 | 0.750 | 0.2621 | 0.0276 | 0.0828 | 0.2000 | 0.0621 | 0.0138 | 0.0414 | 0.0621 | 0.0966 | 0.2207 | 0.0276 | 0.0966 | 0.0345 | 0.0414 |
Jose Alvarado | 2 | PG | 15.4 | no | 72 | 179 | 0.446 | 0.291 | 0.679 | 0.3961 | 0.0325 | 0.0909 | 0.1818 | 0.0844 | 0.0065 | 0.0455 | 0.0909 | 0.1558 | 0.3506 | 0.0390 | 0.1299 | 0.0455 | 0.0649 |
Josh Giddey | 2 | SG | 31.5 | okc | 80 | 205 | 0.419 | 0.263 | 0.709 | 0.3968 | 0.0571 | 0.1905 | 0.2032 | 0.0286 | 0.0127 | 0.1016 | 0.0508 | 0.1651 | 0.3937 | 0.0317 | 0.1238 | 0.0317 | 0.0476 |
Theo Maledon | 2 | PG | 17.8 | okc | 76 | 175 | 0.375 | 0.293 | 0.790 | 0.3989 | 0.0225 | 0.1236 | 0.1236 | 0.0337 | 0.0112 | 0.0730 | 0.0730 | 0.1292 | 0.3483 | 0.0506 | 0.1629 | 0.0843 | 0.1124 |
Jalen Suggs | 2 | SG | 27.2 | orl | 76 | 205 | 0.361 | 0.214 | 0.773 | 0.4338 | 0.0184 | 0.1103 | 0.1618 | 0.0441 | 0.0147 | 0.1103 | 0.1103 | 0.1507 | 0.4191 | 0.0331 | 0.1507 | 0.0956 | 0.1250 |
R.J. Hampton | 2 | PG | 21.9 | orl | 76 | 175 | 0.383 | 0.350 | 0.641 | 0.3470 | 0.0183 | 0.1233 | 0.1142 | 0.0320 | 0.0091 | 0.0639 | 0.0731 | 0.1233 | 0.3242 | 0.0457 | 0.1324 | 0.0548 | 0.0822 |
Chris Paul | 2 | PG | 32.9 | phx | 72 | 175 | 0.493 | 0.317 | 0.837 | 0.4468 | 0.0091 | 0.1216 | 0.3283 | 0.0578 | 0.0091 | 0.0729 | 0.0638 | 0.1702 | 0.3435 | 0.0304 | 0.0942 | 0.0790 | 0.0942 |
Cameron Payne | 2 | PG | 22.0 | phx | 73 | 183 | 0.409 | 0.336 | 0.843 | 0.4909 | 0.0182 | 0.1182 | 0.2227 | 0.0318 | 0.0136 | 0.0818 | 0.0955 | 0.1864 | 0.4591 | 0.0545 | 0.1636 | 0.0591 | 0.0682 |
Dennis Smith Jr. | 2 | PG | 17.3 | por | 74 | 205 | 0.418 | 0.222 | 0.656 | 0.3237 | 0.0289 | 0.1040 | 0.2081 | 0.0694 | 0.0173 | 0.0809 | 0.0809 | 0.1214 | 0.2948 | 0.0116 | 0.0405 | 0.0636 | 0.0983 |
Tyrese Haliburton1 | 2 | PG | 34.5 | sac | 77 | 185 | 0.457 | 0.413 | 0.837 | 0.4145 | 0.0232 | 0.0899 | 0.2145 | 0.0493 | 0.0203 | 0.0667 | 0.0406 | 0.1536 | 0.3333 | 0.0580 | 0.1420 | 0.0493 | 0.0580 |
Davion Mitchell | 2 | PG | 27.7 | sac | 74 | 205 | 0.418 | 0.316 | 0.659 | 0.4152 | 0.0144 | 0.0650 | 0.1516 | 0.0253 | 0.0108 | 0.0542 | 0.0686 | 0.1697 | 0.4043 | 0.0469 | 0.1552 | 0.0253 | 0.0397 |
Derrick White1 | 2 | PG | 30.3 | sa | 76 | 190 | 0.426 | 0.314 | 0.869 | 0.4752 | 0.0165 | 0.0990 | 0.1848 | 0.0330 | 0.0297 | 0.0594 | 0.0792 | 0.1650 | 0.3828 | 0.0561 | 0.1749 | 0.0924 | 0.1089 |
Tre Jones | 2 | PG | 16.6 | sa | 73 | 185 | 0.490 | 0.196 | 0.780 | 0.3614 | 0.0241 | 0.1084 | 0.2048 | 0.0361 | 0.0060 | 0.0422 | 0.0663 | 0.1446 | 0.2952 | 0.0060 | 0.0422 | 0.0602 | 0.0783 |
Malachi Flynn | 2 | PG | 12.2 | tor | 73 | 175 | 0.393 | 0.333 | 0.625 | 0.3525 | 0.0164 | 0.0984 | 0.1311 | 0.0410 | 0.0082 | 0.0246 | 0.0820 | 0.1311 | 0.3443 | 0.0574 | 0.1639 | 0.0246 | 0.0410 |
Mike Conley | 2 | PG | 28.6 | utah | 73 | 175 | 0.435 | 0.408 | 0.796 | 0.4790 | 0.0245 | 0.0839 | 0.1853 | 0.0455 | 0.0105 | 0.0594 | 0.0699 | 0.1678 | 0.3846 | 0.0804 | 0.2028 | 0.0629 | 0.0804 |
Ish Smith1 | 2 | PG | 22.0 | wsh | 72 | 175 | 0.457 | 0.357 | 0.600 | 0.3909 | 0.0227 | 0.1136 | 0.2364 | 0.0455 | 0.0227 | 0.0682 | 0.0727 | 0.1818 | 0.4000 | 0.0227 | 0.0682 | 0.0045 | 0.0091 |
Raul Neto | 2 | PG | 19.6 | wsh | 73 | 180 | 0.463 | 0.292 | 0.769 | 0.3827 | 0.0102 | 0.0867 | 0.1582 | 0.0408 | 0.0000 | 0.0561 | 0.0765 | 0.1480 | 0.3214 | 0.0255 | 0.0867 | 0.0612 | 0.0765 |
Aaron Holiday | 2 | G | 16.2 | wsh | 72 | 185 | 0.467 | 0.343 | 0.800 | 0.3765 | 0.0123 | 0.0864 | 0.1173 | 0.0370 | 0.0123 | 0.0617 | 0.0926 | 0.1481 | 0.3210 | 0.0370 | 0.0988 | 0.0432 | 0.0556 |
From the list, choose a player you like from a team that has several of these types of players. They’d be more likely to part ways. Assess the strengths of the pertinent players and propose a trade! How does it look?
Feel free to make the trades as complex as you wish, but try to choose something that the opposing team would agree to.
Defend your proposed trade using the cluster information. You may add in some basketball knowledge if you like.
What do you think of this process? What are the strengths and weaknesses of evaluating a team based on cluster membership?