K-Means Clustering with NBA Data

Author

Caleb Skinner, Tony Munoz, Michael P.B. Gallaugher, Rodney X. Sturdivant

Overview

Cluster analysis is a statistical analysis tool that partitions observations into sub-populations of similar characteristics within the data set. This process can be useful, because similar observations often behave and respond to stimuli in similar ways. Identifying clusters can allow researchers to predict and draw conclusions on the behavior of certain groups. There are many popular topics that use cluster analysis: risk analysis, marketing, real estate, insurance, medical research, and earthquakes.

In this module, we’ll use the clustering of NBA players as an example. Suppose you were an NBA General Manager interested in constructing a high-quality team. The best teams use lots of different kinds of players to achieve their goals. Golden State Warriors Guard Stephen Curry is an incredible shooter and ball-handler, but the Warriors need other kinds of players, too. A team comprised completely of Stephen Curry and his clones would struggle to defend or rebound the ball. The team would also struggle to give each Stephen Curry the playing time and shots that he has come to expect. Instead, General Managers can separate potential players into groups, because it helps them to identify their team needs. This is where cluster analysis proves useful.

For this exercise, imagine that you are the General Manager of the Dallas Mavericks. You are tasked with creating a strong, balanced team. Later in the module, you will have an opportunity to create hypothetical trade scenarios that could benefit the team.

Getting Started

Required Packages

We will be using the following packages in this module. Take the time now to make sure these packages are installed and loaded on your computer.

library("parameters")
library("factoextra")
library("NbClust")
library("cluster")
library("formatR")
library("tidyverse"); theme_set(theme_minimal())
library("ClusterR")
library("mclust")
library("easystats")
library("here")
library("knitr")
library("kableExtra")
library("condformat")
library("formattable")
library("reactablefmtr")
library("scales")
library("plotly")
library("flextable")

The Data

Our data for this exercise comes from the 2021-2022 NBA Season. This season, the Mavericks finished 4th in the Western Conference with 52 wins and 30 losses under coach Jason Kidd. They exceeded expectations and made the Western Conference Finals.

Our data includes 374 players. Each of these 374 players fulfilled our requirements of appearing in at least 25 games and playing an average of at least 12 minutes (a complete game is 48) in those games. Because of midseason trades or acquisitions, some of the players will appear in our data twice. That’s because they fulfilled our playing time requirements for two different teams in the same season. The second iteration of the player will be marked with a 1 following his name (i.e. Smith becomes Smith1). We’ve divided the variables into two data sets.

The first set of variables are focused on determining the influence a player has on the game. Some of these variables are the players’ minutes per game, total games played and started, points and rebounds per game, and field goal attempts per game. This will be helpful in clustering the players into groups of stars, average starters, and reserves. We’ve termed this data set “usage”. Below is a data dictionary for the first set of variables.

Variable	Explanation	Example
Name	nba player's first and last name	Trae Young or Trae Young1
POS	playing position	PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)
Team	abbreviation of city of player's team	atl (Atlanta), bos (Boston), etc.
GP	total games played	46, 70, etc.
GS	total games started	7, 56, etc.
MIN	minutes per game	18.2, 30.2, etc.
PTS	points per game	6.8, 14.9, etc.
AST	assists per game	1.1, 3.5, etc.
TO	turnovers per game	0.8, 1.7, etc.
STL	steals per game	0.5, 1.1, etc.
OR	offensive rebounds per game	0.5, 1.4, etc.
DR	defensive rebounds per game	2.3, 4.1, etc.
BLK	blocks per game	0.2, 0.6, etc.
PF	personal fouls per game	1.5, 2.4, etc.
FGM	field goals made per game	2.6, 5.5, etc.
FGA	field goals attempted per game	5.4, 12.2, etc.
3PM	3-point field goals¬ made per game	0.6, 1.9, etc.
3PA	3-point field goals attempted per game	1.9, 5.2, etc.
FTM	free throws made per game	0.8, 2.2, etc.
FTA	free throws attempted per game	1.1, 2.8, etc.
PER	player efficiency rating metric	11.74, 17.27, etc.
SC-EFF	scoring efficiency	1.162, 1.332, etc.
SH-EFF	shooting efficiency	0.48, 0.56, etc.

And here is a small slice of the usage data set.

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF	SH-EFF
Trae Young	PG	atl	76	76	34.9	28.4	9.7	4.0	0.9	0.7	3.1	0.1	1.7	9.4	20.3	3.1	8.0	6.6	7.3	25.48	1.396	0.54
John Collins	PF	atl	54	53	30.8	16.2	1.8	1.1	0.6	1.7	6.1	1.0	3.0	6.3	11.9	1.2	3.3	2.5	3.1	18.75	1.360	0.58
Bogdan Bogdanovic	SG	atl	63	27	29.3	15.1	3.1	1.1	1.1	0.5	3.5	0.2	2.1	5.4	12.6	2.7	7.3	1.5	1.8	15.49	1.196	0.54
De'Andre Hunter	SF	atl	53	52	29.8	13.4	1.3	1.3	0.7	0.5	2.8	0.4	2.9	4.8	10.8	1.4	3.7	2.4	3.1	10.66	1.233	0.51
Kevin Huerter	SG	atl	74	60	29.6	12.1	2.7	1.2	0.7	0.4	3.0	0.4	2.5	4.7	10.3	2.2	5.6	0.6	0.7	11.91	1.174	0.56

The second set of variables are helpful in determining a player’s role or function in the game. Some of these variables are Field Goal Percentage, Height, and Weight. Lots of the common variables have been converted into per minute values in order to isolate their frequency. These players will be divided into sub-groups like scorers, big men, and wings. We’ve termed this data set “role”. Below is a data dictionary for the second set of variables.

Variable	Explanation	Example
Name	nba player's first and last name	Trae Young or Trae Young1
POS	playing position	PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)
Team	abbreviation of city of player's team	atl (Atlanta), bos (Boston), etc.
Height	height in inches	76, 81, etc.
Weight	weight in pounds	200, 234, etc.
PTSPerMin	points per minute	0.356, 0.515, etc.
ASTPerMin	assists per minute	0.055, 0.133, etc.
TOPerMin	turnovers per minute	0.036, 0.065, etc.
STLPerMin	steals per minute	0.023, 0.038, etc.
ORPerMin	offensive rebounds per minute	0.022, 0.066, etc.
DRPerMin	defensive rebounds per minute	0.101, 0.175, etc.
BLKPerMin	blocks per minute	0.009, 0.027, etc.
PFPerMin	fouls per minute	0.064, 0.099, etc.
FGP	field goal percentage	0.417, 0.496, etc.
FGMPerMin	field goals made per minute	0.131, 0.192, etc.
FGAPerMin	field goals attempted per minute	0.284, 0.419, etc.
3PP	3 point percentage	0.306, 0.379, etc.
3PMPerMin	3 point field goals made per minute	0.029, 0.072, etc.
3PAPerMin	3 point field goals attempted per minute	0.094, 0.192, etc.
FTP	free throw percentage	0.709, 0.842, etc.
FTMPerMin	free throws made per minute	0.039, 0.087, etc.
FTAPerMin	free throws attempted per minute	0.053, 0.112, etc.

And here is a small slice of the role data set.

Name	POS	Team	Height	Weight	PTSPerMin	ASTPerMin	TOPerMin	STLPerMin	ORPerMin	DRPerMin	BLKPerMin	PFPerMin	FGP	FGMPerMin	FGAPerMin	3PP	3PMPerMin	3PAPerMin	FTP	FTMPerMin	FTAPerMin
Trae Young	PG	atl	73	180	0.814	0.278	0.115	0.026	0.020	0.089	0.003	0.049	0.460	0.269	0.582	0.382	0.089	0.229	0.904	0.189	0.209
John Collins	PF	atl	81	235	0.526	0.058	0.036	0.019	0.055	0.198	0.032	0.097	0.526	0.205	0.386	0.364	0.039	0.107	0.793	0.081	0.101
Bogdan Bogdanovic	SG	atl	78	220	0.515	0.106	0.038	0.038	0.017	0.119	0.007	0.072	0.431	0.184	0.430	0.368	0.092	0.249	0.843	0.051	0.061
De'Andre Hunter	SF	atl	80	225	0.450	0.044	0.044	0.023	0.017	0.094	0.013	0.097	0.442	0.161	0.362	0.379	0.047	0.124	0.765	0.081	0.104
Kevin Huerter	SG	atl	79	190	0.409	0.091	0.041	0.024	0.014	0.101	0.014	0.084	0.454	0.159	0.348	0.389	0.074	0.189	0.808	0.020	0.024

Part 1: Idea of similarity/distance - Interactive

Below is a set of ten Dallas Maverick Players from 2021-2022 that met our playing-time restrictions. Kristaps Porzingis was traded in the middle of the season, but he still met our playing-time qualifications for the Dallas Mavericks. For this example, we’ve combined a few of the variables from both the usage and role data sets. Consider the players Sterling Brown, Maxi Kleber, Dwight Powell, and Josh Green.

Name	Height	Weight	MIN	PTS	OR	DR	AST	STL	BLK	TO	2PA	2PP	3PA	3PP	3PAPerMin	ORPerMin
Luka Doncic	79	230	35.4	28.4	0.9	8.3	8.7	1.2	0.6	4.5	12.8	0.528	8.8	0.353	0.249	0.025
Kristaps Porzingis	87	240	29.5	19.2	1.9	5.8	2.0	0.7	1.7	1.6	9.9	0.537	5.1	0.283	0.173	0.064
Jalen Brunson	73	190	31.9	16.3	0.5	3.4	4.8	0.8	0.0	1.6	9.6	0.545	3.2	0.373	0.100	0.016
Tim Hardaway Jr.	77	205	29.6	14.2	0.3	3.4	2.2	0.9	0.1	0.8	5.4	0.473	7.2	0.336	0.243	0.010
Dorian Finney-Smith	79	220	33.1	11.0	1.5	3.2	1.9	1.1	0.5	1.0	3.2	0.599	5.4	0.395	0.163	0.045
Dwight Powell	82	240	21.9	8.7	2.1	2.8	1.2	0.5	0.5	0.8	4.4	0.703	0.5	0.351	0.023	0.096
Reggie Bullock	78	205	28.0	8.6	0.5	3.1	1.2	0.6	0.2	0.6	1.6	0.550	5.8	0.360	0.207	0.018
Maxi Kleber	82	240	24.6	7.0	1.2	4.7	1.2	0.5	1.0	0.8	1.7	0.586	4.3	0.325	0.175	0.049
Josh Green	77	200	15.5	4.8	0.8	1.6	1.2	0.7	0.2	0.7	2.7	0.573	1.2	0.359	0.077	0.052
Sterling Brown	77	219	12.8	3.3	0.5	2.5	0.7	0.3	0.1	0.5	1.3	0.492	1.9	0.304	0.148	0.039

Exercise 1

For these four players, compare their available statistics.
Which of the four players are most similar kinds of players? Which variables make them similar?
Which variables do they most differ? Which of the four players are the most “different”? Which variables differentiate them the most? Are they similar in any of the categories?

One common and effective way to compare the similarity of two points (or in this case, players) is the euclidean distance formula. The distance formula is found by the following formula:

$d = \sqrt{(x_{2} - x_{1})^{2} + (y_{2} - y_{1})^{2}}$

You can visualize this as drawing the shortest line possible between two points and then measuring it. Right now, our variables are in different units (inches, pounds, points, percentage, etc.), so we’ll standardize (more on this later) each of the variables, so the units are equal. This helps each variable to have equal importance in our distance formula.

Below is a table of the distances between each of the players. Match up the player in the column with the player in the row and you’ll find the distance between them. The smaller the value, the more similar the players are.

               Dwight Powell Maxi Kleber Josh Green
Maxi Kleber         4.269475                       
Josh Green          4.554980    4.270914           
Sterling Brown      5.846063    3.940473   3.102775

Below is a visualization of the distances. As the distances increase, the color changes from red to blue. Players matched with themselves will be dark red, because their distance is 0.

fviz_dist(Distance, gradient = list(low = "indianred3",mid = "white", high = "dodgerblue3"))

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Exercise 2

Do the tabulated results agree with your previous assessment?
Which is more accurate: your original assessment or the similarity metric?

Part 2: Performing a Cluster Analysis

Calculating the distance between points is the first step in a distance-based cluster analysis. The players with the smallest distance (or with the most similarity) between them are naturally placed in a cluster together.

How does the clustering actually work? As an illustration, we’ll use a basic plot of the Offensive Rebounds and 3-Point Shooting of our Dallas Mavericks players. We’ve standardized the results by adjusting them to per-minute values.

Exercise 3

What do you notice about the data? How would you group the players?
How would you describe these groupings?
In a cluster analysis, every point needs to belong to a cluster. Do any points not seem to have a cluster?

Cluster analysis is the process of partitioning the data into sub-populations or clusters. This is done so that observations in the same cluster are more similar to each other than observations in a different group. These clusters then can be analyzed.

One common method to divide the data into these clusters is distance based and uses the K-Means Algorithm. The k-means algorithm partitions the data into clusters which can then be analyzed. Furthermore, this is performed in an unsupervized fashion. This means that the clusters are found by the algorithm and not predetermined by the researcher. In the NBA example, we cannot determine our clusters beforehand. The algorithm may confirm our original intuition, but this is not guaranteed.

The K-Means Algorithm assigns the data into clusters so that the sum squared distance between the center (or mean) of the clusters and each observation is minimized. At the end, the variance of the all the points within each cluster is as small as possible. One downside of the K-Means Algorithm is that users must predetermine the number of clusters they’d like to create. This is entered as the parameter, K. Let’s say we want to separate our data into K = 2 clusters. The K-Means algorithm will go through four basic steps:

Randomly select two initial cluster centers.
Assign each observation to the closest center.
Calculate the mean of all the observations within each cluster. These cluster means become the new center of each cluster.
Repeat steps 2-3 until no further changes are made.

As these steps are followed, the clusters will move closer and closer to their final positions. Since the first step is to randomly assign cluster centers, the K-Means approach can occasionally yield different results. It’s worth trying it a few different times with different starting points.

Before you look below, provide your estimation of the two clusters of our Dallas Mavericks players. Where would you anticipate the cluster centers to be located?

The code below runs the k-means algorithm. In the kmeans function, the first argument is the data, the second is the number of clusters to be fit (i.e. $k$) and nstart is the number of random starting points to use for the algorithm.

set.seed(321)
dallasKMeans_prep <- dallas %>%
  select(Name, `3PAPerMin`, ORPerMin) %>%
  column_to_rownames(var = "Name")

dallas2Means <- kmeans(dallasKMeans_prep, centers = 2, nstart = 50)

Exercise 4

Is this how you would have grouped the players?
Notice the large points in the middle of each cluster. These are the cluster centers. Are they where you expected?
How do you think the groupings will change with three clusters?

How do you think the groupings will change with three clusters? We can easily tell K-Means to randomly assign three centers, and the process of assigning points to cluster means will continue exactly as before.

set.seed(3)
dallas3Means <- kmeans(dallasKMeans_prep, centers = 3, nstart = 50)

dallas3fviz <- fviz_cluster(dallas3Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 3 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas3fviz

Or four clusters?

set.seed(22329)
dallas4Means <- kmeans(dallasKMeans_prep, centers = 4, nstart = 50)

dallas4fviz <- fviz_cluster(dallas4Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 4 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas4fviz

Exercise 5

What happens to Dwight Powell when we increase $k$ to 4?
Would Dwight be considered an outlier? Why? Is this helpful from a clustering perspective?

Now consider five clusters.

set.seed(102)
dallas5Means <- kmeans(dallasKMeans_prep, centers = 5, nstart = 50)

dallas5fviz <- fviz_cluster(dallas5Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 5 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas5fviz

At some point, the power of clustering the points begins to fade. Does Dwight Powell deserve to be in a cluster of his own? Possibly. Does Reggie Bullock? Definitely not.

Exercise 6

Which of the four values of K did you find most useful or accurate?
Were there ever too few or too many clusters?

Part 4: Choosing the Number of Clusters

So, how can we choose the optimal number of clusters?

It’s helpful to evaluate the effectiveness of the clusters for each value K. There are plenty of ways to test this effectiveness, but we’ll walk through a common example called the Elbow Method. The Elbow Method totals up the distance between the centers of each cluster and their observations. This is called the Total Within Summed Squares (TWSS). As K increases and more clusters are added to the model, the sum squared distance will decrease. Eventually, the value of each additional cluster diminishes. The Elbow Method plots the results, and the user can look for a point when increasing the number of clusters no longer proves useful. Often, this point looks like an Elbow.

fviz_nbclust(dallasKMeans_prep, kmeans, method = "wss", k.max = 9) +
  theme_minimal() +
  labs(title = "The Elbow Method")

The graph demonstrates that the value of each additional cluster decreases as more clusters are added. The bends in the graph indicate that clusters beyond four have little value. Despite being common, the Elbow Method is often ambiguous and difficult to interpret. Look for the bend in the Elbow Plot. K = 2, K = 3, and K = 4 all seem like reasonable conclusions.

The Elbow plot is just one test to determine the optimal number of clusters. Two other popular methods are the Average Silhouette Method and the Gap Statistic Method. In all, there are dozens of methods to determine the ideal number of clusters and they often disagree. We’ll take a consensus of 27 methods and proceed from there.

dallasClust <- n_clusters(dallasKMeans_prep,
                          package = c("easystats", "NbClust"),
                          standardize = FALSE, n_max = 5)

plot(dallasClust)

The tests give varied estimates for the optimal clusters, but it is up to the user to decide how many clusters you will include in your K-Mean Algorithm. It’s common practice to choose several and compare the results of each.

From there, we would conduct our analysis of each cluster and examine the results.

After the clustering is completed, how can we analyze our clustering solution?

We want to reduce the Total Within Summed Squares (TWSS) or distance from each observation to its cluster mean, but we also want to minimize the total number of clusters used.

Two helpful measurements to summarize these preferences for our clusters are intra-class similarity and inter-class similarity.

Intra-class similarity tests the relationship between observations of the same cluster. We want this similarity to be high. We want all the observations in a cluster to exhibit similar features.

Inter-class similarity tests the relationship between different clusters. We want this relationship to be low. Ideally, each cluster is distinct and the observations within can be clearly assigned to a cluster.

As we increase the number of clusters, K. The intra-class similarity will increase, because observations will be assigned to smaller clusters that a more representative. However, the inter-class similarity will also increase, because the cluster centers are now closer together. This is why it is impractical to choose a large value for K.

Recall our clustering for the Dallas Mavericks players.

dallas2fviz
dallas3fviz +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
dallas4fviz +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())

Exercise 7

Which value of K has the highest intra-class similarity?
Which cluster specifically?
Which value of K has the highest inter-class similarity?

Part 5: A Larger Dataset

Let’s focus now on our larger data set with many more variables and observations. It seems like it’d be more complicated, but the process is almost exactly the same. One important distinction to remember is that the large number of dimensions make the data difficult to visualize. There are different methods that aid in this visualization. We’ll walk you through the usage data set and demonstrate appropriate analysis, and then allow you to work through the role data set.

Remember the usage data set? It contains variables aimed at categorizing the workload and skill of the players. We hope to divide players into sub-groups like stars and bench players.

It is very important that we standardize the data first. Lots of our variables have different units. Games played and Blocks per game are hard to compare without scaling. Without standardizing, the large values- like Games Started or Games Played- will exert too much influence on the data. Now, each value is described in relation to the other observations. After standardizing, Trae Young’s assist total is 3.656, so we know that he has a lot more assists than the average player in our data set. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF	SH-EFF
Trae Young	PG	atl	1.212	1.690	1.478	2.819	3.656	3.186	0.346	-0.422	-0.191	-0.961	-0.433	2.418	2.446	2.049	1.883	3.457	2.948	2.481	0.910	0.085
John Collins	PF	atl	-0.221	0.820	0.903	0.802	-0.386	-0.291	-0.479	0.857	1.514	1.296	1.704	0.984	0.623	-0.071	-0.116	0.540	0.502	0.924	0.676	0.797
Bogdan Bogdanovic	SG	atl	0.365	-0.163	0.692	0.620	0.279	-0.291	0.896	-0.678	0.036	-0.710	0.224	0.568	0.775	1.603	1.585	-0.172	-0.255	0.170	-0.389	0.085
De'Andre Hunter	SF	atl	-0.286	0.782	0.762	0.339	-0.642	-0.052	-0.204	-0.678	-0.362	-0.209	1.539	0.290	0.385	0.152	0.054	0.469	0.502	-0.948	-0.149	-0.449
Kevin Huerter	SG	atl	1.082	1.085	0.734	0.124	0.074	-0.172	-0.204	-0.806	-0.248	-0.209	0.882	0.244	0.276	1.045	0.862	-0.812	-0.895	-0.658	-0.532	0.441

Let’s begin by taking a look at the Elbow plot of the usage dataset.

usage_rm <- usage %>%
  select(-Name, -POS, -Team) %>%
  mutate(across(where(is.numeric), standardize))

fviz_nbclust(usage_rm, kmeans, method = "wss", k.max = 24) +
  theme_minimal() +
  labs(title = "The Elbow Method")

The Elbow plot shows that the algorithm experiences diminishing returns after K = 2 and K = 3. From the Elbow Plot, we would expect that the consensus lies somewhere between 2 and 5 clusters. Now consider the multiple methods for the selection of $k$.

The tests favor three clusters. Some tests also prefer two and four clusters, so those models are worth a look.

set.seed(121)
usage2Means <- kmeans(usageKMeans_prep, centers = 2, nstart = 50)
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)
set.seed(1210)
usage4Means <- kmeans(usageKMeans_prep, centers = 4, nstart = 50)

K = 2 Clusters

Let’s start simple and begin with K = 2 clusters.

But before we begin, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering. If some have little or no influence, we can simplify our analysis by removing them.

The visualization below demonstrates the differences between our two clusters. The variables that have large differences are important in the clustering assignment. They greatly influence the assignment of an observation.

as_tibble(usage2Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
  group_by(variable) %>%
  summarise(Influence = abs(mean(value))) %>%
  mutate(
    variable = factor(variable, levels = usage_levels)
  ) %>% 
  ggplot(aes(x = variable, y = Influence)) +
  geom_bar(stat = "identity", fill = "cadetblue3") +
  labs(title = "Influence on Cluster Assignment", x = "", y = "") +
  theme(axis.text.y = element_blank(),
        legend.position = "none",
        axis.text.x = element_text(angle = -45, size = 9))

This type of exercise is essential for clustering analysis, because it allows one to see which variables are important to consider when classifying an observation.

This visualization scales the centers of the variables for each cluster and contrasts them. Variables with large positive or negative values have a large influence on the clustering. These variables help differentiate the cluster. Variables with an influence close to 0 have less importance.

We see a great diversity in the variables that possess significant influence on the clustering.

Exercise 8

Which variables seem to contribute the most to the clustering result?
Which variables contribute the least to the clustering result?

Scoring Efficiency and Shooting Efficiency both lack influence. Games Played, Offensive Rebounds, and Blocks all also don’t contribute much to our clustering. We chose to remove Shooting Efficiency and keep the other four, but we easily could have removed them from our analysis.

Note for Reviewer. Removing the five variables causes a slight shift in the cluster assignment. This changes some of the analysis and points I was making on the outliers, and it makes comparison between K = 2 and K = 3 more difficult. We don’t remove any of the variables when K = 3. Still, it could make things confusing to not remove variables with very little influence. I’m open to suggestions on what to do here.

set.seed(121)
usage2Means <- usageKMeans_prep %>%
  select(-`SH-EFF`) %>%
  kmeans(centers = 2, nstart = 50)

usage2 <- usage %>% select(-`SH-EFF`)
usage_rm2 <- usage_rm %>% select(-`SH-EFF`)

Now that we’ve removed some variables. Let’s see how many observations are within each cluster.

Cluster	Size
1	119
2	255

The clusters are not identical in size, and it’s different enough that we should keep an eye on it. It’s important to verify that each of the clusters contain a significant number of observations. Like we saw with Dwight Powell earlier, sometimes small clusters can tell us valuable information about the observations they contain.

The K-Means Algorithm will assign each observation a cluster and print out descriptive statistics of each cluster. This can give us a good idea of what makes up each cluster. We went back and unstandardized the data.

usage2centers <- as_tibble(usage2Means$cluster) %>%
  mutate(Name = usage$Name) %>%
  rename(Clusters = value) %>% left_join(usage2, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(across(where(is.numeric), round, digits = 3))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

usage2centers %>% flextable() %>% align(align = "center", part = "all") %>%
  width(j = c(2:15), width = .5)

Clusters	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF
1	60.689	56.706	32.572	18.656	4.417	2.266	1.035	1.141	4.845	0.582	2.337	6.773	14.592	1.926	5.386	3.178	3.969	17.824	1.281
2	55.847	19.455	20.540	7.937	1.685	0.913	0.652	0.978	2.779	0.437	1.789	2.958	6.429	0.954	2.728	1.071	1.429	13.324	1.244

Generally, it looks like cluster 1 contains starter caliber players and cluster 2 includes the bench players. This helps to explain why cluster 1 is a bit smaller than cluster 2.

Now, let’s look at the clusters graphically. This can help us to see how different the clusters really are from each other. The graph is created by combining the values of all the variables in a visually understandable way. This is through a process called Principle Component Analysis (PCA). Link to more defined explanation of PCA.

usage2fviz <- fviz_cluster(usage2Means, usageKMeans_prep,
                           geom = "point",
                           show.clust.cent = TRUE, stand = FALSE,
                           pointsize = 1,
                           main = "Usage K = 2 Clusters")
usage2fviz

Many of the observations in both clusters lie close to the border. This indicates that the division between the clusters was close and there may be some observations that could have been placed in either cluster. The centers are fairly close and located at about (-3,0) and (2,0).

There are several large outliers in both clusters, but especially in the lower portion of the visualization in both clusters and the left portion cluster 1.

Prototypes

To help us understand the clusters better, let’s look at some players that fall very close to the cluster center. We’ll call the players that represent the cluster well prototype players.

usage2Means_scale <- as_tibble(usage2Means$centers) %>%
  mutate(cluster = 1:2)


usage_fitted2Means <- usage2Means$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(usage2Means_scale) %>% select(-cluster)

Joining with `by = join_by(cluster)`

distances <- sqrt(rowSums((usage_rm2 - usage_fitted2Means)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = usage$Name,
    Cluster = usage2Means$cluster
  )
dist_slice1 <- distances %>%
  arrange(distance) %>%
  select(Name, Cluster, distance) %>%
  filter(Cluster == 1) %>% slice(1:3)

dist_slice1 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name	Cluster	distance
Khris Middleton	1	1.8408
Miles Bridges	1	1.9950
Gordon Hayward	1	2.2011

Exercise 9

Which player is closest to the center for Cluster 1?
Are there other players who are close to the center for Cluster 1 that could also be considered prototypes?
Look at the prototype players’ statistics to see if we characterize Cluster 1.

This would be a good opportunity to play highlights of one of the players or show a picture or something to keep people engaged.

prototype_k2c1 <- dist_slice1 %>% select(Name) %>% left_join(usage2)

Joining with `by = join_by(Name)`

prototype_k2c1 %>% flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF
Khris Middleton	SF	mil	66	66	32.4	20.1	5.4	2.9	1.2	0.6	4.8	0.3	2.4	6.8	15.5	2.5	6.6	3.9	4.4	18.19	1.298
Miles Bridges	SF	cha	80	80	35.5	20.2	3.8	1.9	0.9	1.1	5.9	0.8	2.4	7.5	15.2	1.9	5.8	3.3	4.2	17.97	1.329
Gordon Hayward	SF	cha	49	48	31.9	15.9	3.6	1.7	1.0	0.8	3.8	0.4	1.7	5.8	12.6	1.8	4.5	2.6	3.0	15.11	1.261

Consider Khris Middleton, Miles Bridges, and Gordon Hayward. The three players all play a similar position; one that allows them to contribute in all areas of the game. There was significant variety in the number of Games Played, but they Started in each game and received a lot of playing time. They all played over 30 Minutes per game and scored about 20 Points a game. Their Rebound, Assist, Block, and Turnover totals vary a little bit, but they are all fairly high. They all took and made roughly the same number of shots per game (15.2-15.9 FGA) and (6.8-7.5 FGM).

Let’s move on to cluster 2. First, notice how much smaller the distances are from the cluster 2 center. More observations lie close to cluster 2’s center than cluster 1. This is not entirely surprising, as there are almost 100 more players in cluster 2 than 1.

Again consider potential prototypes for the second cluster.

dist_slice2 <- distances %>% arrange(distance) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)

dist_slice2 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name	Cluster	distance
Blake Griffin	2	1.1953
Torrey Craig	2	1.2661
Rudy Gay	2	1.2779

Blake Griffin is our prototype player of cluster 2. Torrey Craig and Rudy Gay are also strong representative of cluster 2 as well.

prototype_k2c2 <- dist_slice2 %>% select(Name) %>% left_join(usage2)

Joining with `by = join_by(Name)`

prototype_k2c2 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF
Blake Griffin	PF	bkn	56	24	17.1	6.4	1.9	0.6	0.5	1.1	3.0	0.3	1.7	2.4	5.6	0.7	2.6	1.0	1.4	13.77	1.147
Torrey Craig	SF	ind	51	14	20.3	6.5	1.1	0.8	0.5	1.2	2.7	0.4	1.9	2.5	5.5	0.9	2.7	0.5	0.7	10.82	1.171
Rudy Gay	SF	utah	55	1	18.9	8.1	1.0	0.9	0.5	1.0	3.4	0.3	1.7	2.9	6.9	1.3	3.7	1.1	1.4	13.06	1.177

Once again, the prototypes look like an average NBA player. They each played around 55 Games and Started in very few of them. They played about 17.1-20.3 Minutes a game and scored from 6.4-8.1 Points a game. Their Rebound, Assist, Steal, Block, Turnover, and Foul values are fairly low and generally close together. They also don’t take as many shots as cluster 1 - only about 6 Field Goal Attempts per game.

Outliers

Now, let’s look through some of the players that fall farthest from the center of their cluster. These players are cluster outliers. In these cases, the clustering least represents the observation. These players are very different from the center. It can be helpful to identify and explain outliers by comparing them to our prototype players. How do they differ? What attributes led to their classification?

Is there a way to only label a few of the points in the visualization

dist_slice3 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 1) %>% slice(1:2,5)

dist_slice3 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name	Cluster	distance
Rudy Gobert	1	9.1597
Joel Embiid	1	8.8545
Myles Turner	1	6.6678

outlier_k2c1 <- dist_slice3 %>% select(Name) %>%
  add_row(Name = "Khris Middleton") %>% add_row(Name = "Blake Griffin") %>% # want to add centers of clusters for reference
  left_join(usage2) %>% arrange(desc(DR))

Joining with `by = join_by(Name)`

outlier_k2c1 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF
Rudy Gobert	C	utah	66	66	32.1	15.6	1.1	1.8	0.7	3.7	11.0	2.1	2.7	5.5	7.7	0.0	0.1	4.6	6.7	24.76	2.022
Joel Embiid	C	phi	68	68	33.8	30.6	4.2	3.1	1.1	2.1	9.6	1.5	2.7	9.8	19.6	1.4	3.7	9.6	11.8	31.24	1.558
Myles Turner	C	ind	42	42	29.4	12.9	1.0	1.3	0.7	1.5	5.5	2.8	2.8	4.8	9.4	1.5	4.4	1.9	2.5	17.45	1.374
Khris Middleton	SF	mil	66	66	32.4	20.1	5.4	2.9	1.2	0.6	4.8	0.3	2.4	6.8	15.5	2.5	6.6	3.9	4.4	18.19	1.298
Blake Griffin	PF	bkn	56	24	17.1	6.4	1.9	0.6	0.5	1.1	3.0	0.3	1.7	2.4	5.6	0.7	2.6	1.0	1.4	13.77	1.147

Sometimes, you’ll need to do some digging on the outliers. We chose to show you Khris Middleton and Blake Griffin’s characteristics again for comparison. Joel Embiid, Giannis Antetokounmpo, and Myles Turner represent two very different kinds of outliers. Embiid and Giannis are superstars. They finished second and third in the MVP voting in the 2021-2022 season. They are very far from the prototype of cluster 1, but they are even further from the prototype of cluster 2. These are the points near (-10, -5) in the visualization.

Myles Turner, however, possesses some attributes that could be classified as cluster 1 and cluster 2. He played lots of Minutes, Started most games, and had strong Rebounding values. However, his shooting numbers fall right between the clusters, and he doesn’t tally very many Points, Assists, Steals, or Turnovers. This point is likely the (-5, -9) outlier in the visualization. He is a borderline case. Is there a more statistical word for this?

dist_slice4 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)

dist_slice4 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name	Cluster	distance
Robert Williams III	2	7.4501
Mitchell Robinson	2	7.3269
Clint Capela	2	6.4898

outlier_k2c2 <- dist_slice4 %>% select(Name) %>% left_join(usage2)

Joining with `by = join_by(Name)`

outlier_k2c2 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name	POS	Team	GP	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	FTM	FTA	PER	SC-EFF
Robert Williams III	C	bos	61	61	29.6	10.0	2.0	1.0	0.9	3.9	5.7	2.2	2.2	4.4	6.0	1.1	1.5	22.10	1.649
Mitchell Robinson	C	ny	72	62	25.7	8.5	0.5	0.8	0.8	4.1	4.5	1.8	2.7	3.6	4.8	1.2	2.5	20.78	1.778
Clint Capela	C	atl	74	73	27.6	11.1	1.2	0.6	0.7	3.8	8.1	1.3	2.2	5.0	8.2	1.1	2.3	21.43	1.358

These cluster 2 outliers are all similar players. Robert Williams III, Mitchell Robinson, and Clint Capela are all big men. Like Myles Turner, they are players that play a lot of Games and Minutes, get lots of Rebounds and Blocks, but don’t shoot very much. Our data emphasizes shooting a lot and perhaps this leaves players like these without an appropriate cluster. They are borderline candidates that perhaps would benefit from another cluster.

# This below is from code that Dr. Sturdivant sent me. The cluster_analysis function produces a different size clusters than we got from the K-Means function
# set.seed(121)
# res_2means <- cluster_analysis(usage_rm,
#                                n = 2,
#                                method = "kmeans")
# # res_2means
# summary(res_2means)
# 
# # predict(res_2means) # get clusters
# plot(res_2means)

Now, let’s analyze the strength of K = 2 clusters. For reference, we’ve repeated the visualization below.

usage2fviz

The two clusters possess strong inter-class differences. For only two clusters, cluster 1 and cluster 2 are fairly distinct. The centers are far apart and demonstrate two different classifications of players. Cluster 1 is clearly a sub-population of starting, high-volume players and cluster 2 is a sub-population of bench players. Still, we’ve analyzed the outliers and found some players that could fall in either cluster. There could be some confusion for players like Robert Williams and Myles Turner. These players seem more similar to each other than most of the players in their own cluster. These outliers fall around (-2, -7). Check the visualizations again to see the cluster of players near there.

The intra-class similarity is fairly low. The clusters are large and have many outliers in each of the directions. Players like Giannis Antetokounmpo, Khris Middleton, and Myles Turner have little in common, but they are all grouped into cluster 1. Yet, most of cluster 1 produce larger values and most of cluster 2 have smaller numbers.

K = 3 Clusters - Interactive

# keep in case of reset
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)

Now, let’s look at the consensus tests’ most popular number of clusters: K = 3. Here, we’d like you to produce your own analysis of the results. If you need help, look back at the K = 2 example.

As you progress, fill out this table with descriptors of the three clusters. This will be helpful for you as you try to identify their distinctions.

stu_table <- tibble(
  Cluster = 1:3,
  Description = "")

stu_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster	Description
1
2
3

Once again, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering.

This visualization plots the centers for each variable in a cluster. At a glance, this helps us to understand the characteristics of each cluster. We can see that cluster 2, for example, has high offensive rebounds and blocks per game, but low 3 point attempts and 3 point makes.

It can also tell us what variables are unimportant. If a variable has the similar mean throughout all three clusters, then the variable does not help us to distinguish between the clusters. If a variable has a large positive value in one cluster and a large negative value in another, then that variable is very useful for classifying our data.

# creates a dataset of each variable and the standardized center and graphs it
as_tibble(usage3Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
  mutate(variable = factor(variable, usage_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  facet_grid(rows = vars(cluster)) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 10)) +
  scale_y_continuous(position = "right") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.y = element_blank(),
        legend.position = "none")

Before you analyze, remember that variables with a strong negative value still have large influence. It’s just a negative association with a variable instead of a positive association.

What do you notice about the variables? Which kinds of variables possess significant influence? Some variables have a strong influence in one cluster, but a weak influence in another cluster. Why is this?

After analyzing, would you choose to remove any variables from the data?

Is there a better way to look at the variables and remove the less influential ones?

We chose to remove the Games Played variable, because its influence was close to 0 in all three clusters. All of the other variables had a large effect in some category.

# reproducing K = 3 means without insignificant variables
set.seed(4)
usage3Means <- usageKMeans_prep %>%
  select(-GP) %>%
  kmeans(centers = 3, nstart = 50)

# creating a second usage without those variables so i don't have to reproduce it 800 million times.
usage3 <- usage %>% select(-GP)
usage_rm3 <- usage_rm %>% select(-GP)

Now that we’ve removed some variables. Let’s see how many observations are within each cluster.

usage3Means$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster	Size
1	102
2	61
3	211

What do you notice about the cluster size? What could this tell us about the clusters?

The clusters are not identical in size, but the clusters are each large enough that there is no reason to be concerned.

# un-standardizing and calculating the mean
usage3centers <- as_tibble(usage3Means$cluster) %>%
  mutate(Name = usage$Name) %>%
  rename(Clusters = value) %>% left_join(usage3, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(across(where(is.numeric), round, digits = 3))

usage3centers %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = c(2:15), width = .5)

Clusters	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF	SH-EFF
1	57.539	33.089	19.346	4.773	2.355	1.071	0.983	4.652	0.513	2.289	6.966	15.224	2.066	5.755	3.342	4.123	17.816	1.267	0.525
2	36.295	22.354	9.580	1.551	1.175	0.649	2.220	4.575	0.954	2.438	3.818	6.684	0.351	1.034	1.611	2.315	18.374	1.455	0.604
3	17.185	20.735	7.992	1.773	0.902	0.667	0.709	2.519	0.333	1.669	2.924	6.708	1.139	3.254	1.005	1.304	12.231	1.193	0.520

What do you notice about the cluster means? Without looking any further, how would you describe the three clusters? Jot down some notes in your table.

Now, let’s look at the clusters graphically.

usage3fviz <- fviz_cluster(usage3Means, usageKMeans_prep,
                           geom = "point",
                           show.clust.cent = TRUE, stand = FALSE,
                           pointsize = 1,
                           main = "Usage K = 3 Clusters")
usage3fviz

What do you notice about the visualization? Are there a lot of observations that reside on the border? Where are the centers and outliers of each cluster?

Compare the new visualization with the K = 2 visualization. Where did the third cluster come from? What kinds of players?

If you were to create a fourth cluster, what points would you group together?

Let’s look at our prototype and outlier players. We’ve compiled them all into a table for you to compare and contrast.

# standardizing the distances between the players
usage3Means_scale <- as_tibble(usage3Means$centers) %>%
  mutate(cluster = 1:3)

# creating appropriate tibble for distance formula
usage_fitted3Means <- usage3Means$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(usage3Means_scale) %>% select(-cluster)

Joining with `by = join_by(cluster)`

# distance from cluster center
distances <- sqrt(rowSums((usage_rm3 - usage_fitted3Means)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = usage$Name,
    Cluster = usage3Means$cluster)

# creating a master document with all of the prototypes and all of the outliers.
master_distances <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(usage3) %>% arrange(Cluster, desc(Category))

Joining with `by = join_by(Name)`

master_distances %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = 2, width = .8) %>%
  width(j = c(4:15), width = .5)

Name	Cluster	Category	POS	Team	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF	SH-EFF
Miles Bridges	1	Prototype	SF	cha	80	35.5	20.2	3.8	1.9	0.9	1.1	5.9	0.8	2.4	7.5	15.2	1.9	5.8	3.3	4.2	17.97	1.329	0.55
Malcolm Brogdon	1	Prototype	PG	ind	36	33.5	19.1	5.9	2.1	0.8	0.9	4.2	0.4	2.0	6.8	15.1	1.6	5.2	4.0	4.6	18.10	1.265	0.50
Khris Middleton	1	Prototype	SF	mil	66	32.4	20.1	5.4	2.9	1.2	0.6	4.8	0.3	2.4	6.8	15.5	2.5	6.6	3.9	4.4	18.19	1.298	0.52
Nikola Jokic	1	Outlier	C	den	74	33.5	27.1	7.9	3.8	1.5	2.8	11.0	0.9	2.6	10.3	17.7	1.3	3.9	5.1	6.3	32.94	1.529	0.62
Giannis Antetokounmpo	1	Outlier	PF	mil	67	32.9	29.9	5.8	3.3	1.1	2.0	9.6	1.4	3.2	10.3	18.6	1.1	3.6	8.3	11.4	32.12	1.608	0.58
Joel Embiid	1	Outlier	C	phi	68	33.8	30.6	4.2	3.1	1.1	2.1	9.6	1.5	2.7	9.8	19.6	1.4	3.7	9.6	11.8	31.24	1.558	0.53
Nic Claxton	2	Prototype	PF	bkn	19	20.7	8.7	0.9	0.8	0.5	1.9	3.7	1.1	2.3	3.8	5.6	0.0	0.0	1.1	2.0	18.66	1.553	0.67
Isaiah Roby	2	Prototype	PF	okc	28	21.1	10.1	1.6	1.0	0.8	1.7	3.2	0.8	2.4	3.7	7.2	1.0	2.2	1.7	2.6	18.35	1.406	0.58
Richaun Holmes	2	Prototype	C	sac	37	23.9	10.4	1.1	1.2	0.4	2.1	5.0	0.9	2.8	4.4	6.7	0.0	0.1	1.6	2.0	17.80	1.560	0.66
Robert Williams III	2	Outlier	C	bos	61	29.6	10.0	2.0	1.0	0.9	3.9	5.7	2.2	2.2	4.4	6.0	0.0	0.0	1.1	1.5	22.10	1.649	0.74
Myles Turner	2	Outlier	C	ind	42	29.4	12.9	1.0	1.3	0.7	1.5	5.5	2.8	2.8	4.8	9.4	1.5	4.4	1.9	2.5	17.45	1.374	0.59
Rudy Gobert	2	Outlier	C	utah	66	32.1	15.6	1.1	1.8	0.7	3.7	11.0	2.1	2.7	5.5	7.7	0.0	0.1	4.6	6.7	24.76	2.022	0.71
Damion Lee	3	Prototype	SG	gs	5	20.0	7.4	1.0	0.6	0.6	0.4	2.8	0.1	1.5	2.7	6.1	1.0	3.0	1.0	1.2	10.90	1.219	0.52
Ziaire Williams	3	Prototype	SG	mem	31	21.7	8.1	1.0	0.7	0.6	0.4	1.7	0.2	1.8	3.1	6.8	1.2	3.9	0.7	0.9	9.70	1.182	0.54
Rudy Gay	3	Prototype	SF	utah	1	18.9	8.1	1.0	0.9	0.5	1.0	3.4	0.3	1.7	2.9	6.9	1.3	3.7	1.1	1.4	13.06	1.177	0.51
Tomas Satoransky	3	Outlier	SG	no	3	15.0	2.8	2.4	0.7	0.4	0.6	1.4	0.0	1.0	1.0	3.3	0.2	1.0	0.6	0.8	6.51	0.822	0.32
Robert Covington	3	Outlier	PF	por	40	29.8	7.6	1.4	1.2	1.5	0.9	4.9	1.3	2.8	2.7	7.0	1.6	4.8	0.6	0.8	9.98	1.086	0.50
Buddy Hield1	3	Outlier	SG	sac	6	28.6	14.4	1.9	1.6	0.9	0.8	3.2	0.3	2.1	4.8	12.6	3.3	9.0	1.5	1.7	11.96	1.143	0.51

Here is a smaller table that may help you compare the players more easily.

master_distances1 <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 2 | proto_rank < 2) %>%
  mutate(
    Category = if_else(proto_rank < 2, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(usage3) %>% arrange(desc(Category), Cluster)

Joining with `by = join_by(Name)`

master_distances1 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = 2, width = .8) %>%
  width(j = c(4:15), width = .5)

Name	Cluster	Category	POS	Team	GS	MIN	PTS	AST	TO	STL	OR	DR	BLK	PF	FGM	FGA	3PM	3PA	FTM	FTA	PER	SC-EFF	SH-EFF
Khris Middleton	1	Prototype	SF	mil	66	32.4	20.1	5.4	2.9	1.2	0.6	4.8	0.3	2.4	6.8	15.5	2.5	6.6	3.9	4.4	18.19	1.298	0.52
Isaiah Roby	2	Prototype	PF	okc	28	21.1	10.1	1.6	1.0	0.8	1.7	3.2	0.8	2.4	3.7	7.2	1.0	2.2	1.7	2.6	18.35	1.406	0.58
Damion Lee	3	Prototype	SG	gs	5	20.0	7.4	1.0	0.6	0.6	0.4	2.8	0.1	1.5	2.7	6.1	1.0	3.0	1.0	1.2	10.90	1.219	0.52
Joel Embiid	1	Outlier	C	phi	68	33.8	30.6	4.2	3.1	1.1	2.1	9.6	1.5	2.7	9.8	19.6	1.4	3.7	9.6	11.8	31.24	1.558	0.53
Rudy Gobert	2	Outlier	C	utah	66	32.1	15.6	1.1	1.8	0.7	3.7	11.0	2.1	2.7	5.5	7.7	0.0	0.1	4.6	6.7	24.76	2.022	0.71
Tomas Satoransky	3	Outlier	SG	no	3	15.0	2.8	2.4	0.7	0.4	0.6	1.4	0.0	1.0	1.0	3.3	0.2	1.0	0.6	0.8	6.51	0.822	0.32

Use the above tables to summarize each of the 6 categories. What kind of players belong in each category? Is there a lot of variation within the prototypes? Is there a lot of variation within the outliers? Which of the outliers are closest to a different cluster? Would you reclassify any of the outliers?

After looking through the clusters, why do you think cluster 2 is so much smaller?

Let’s analyze the overall strength of K = 3 clusters. How does the intra-class similarity compare with K = 2? The inter-class similarity?

# usage3fviz

Comparing K = 2 to K = 3 - Mix

Often, it is interesting to compare the cluster results. Here, we tabulated the cluster assignments between K = 2 and K = 3. This can help us to see how the clustering with K = 2 overlaps with K = 3.

# creating a tibble of the cluster of each player for each K
clusters <- tibble(
  player = usage$Name,
  Cluster = usage2Means$cluster,
  clus3 = usage3Means$cluster,
  clus4 = usage4Means$cluster
)

# tabulating K = 2 and K = 3 clusters
compare_K2K3 <- with(clusters, table(Cluster, clus3)) %>%
  as_tibble() %>%
  pivot_wider(names_from = clus3, values_from = n)

# printing table using kable
compare_K2K3 %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster	1	2	3
1	102	13	4
2	0	48	207

What do you notice about the clustering distribution?

We can see that most players in cluster 1 from K = 2 stayed in cluster 1 when K = 3. We identified both of these clusters as the “starters,” so this makes a lot of intuitive sense. Most of cluster 2 from K = 2 moved into cluster 3 when K = 3. The interesting transition comes with the middle cluster of K = 3. This cluster is full of big men that don’t score a lot. They came from both cluster 1 and cluster 2 of K = 2. We saw this in our outlier analysis earlier.

Exercise 10

What are the benefits and costs of both K = 2 and K = 3? Which would you choose?

Part 6: Role Data Set

Now we move on to a second data set and we want to give you a lot more autonomy to test different clusters or outliers yourself. The data set is different, but the process is almost exactly the same. If you have questions, we’ll give you hints or you can look back to the usage data set for a clear example.

Remember the role data set? It contains variables aimed at categorizing the function and specific characteristics of the players. We hope to divide players into sub-groups like scorers, 3-point shooters, and rebounders.

Even though most of our data has been set to adjusted “per minute” quantities. It is still very important that we standardize the data first. Otherwise common values like points per minute will outweigh the effect of less common characteristics like blocks per minute. Now each variable is on the same scale. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.

We could also give a short mini lesson on the importance of standardizing using games started and blocks or something like that.

# initializing our datasets a second time in case student decides to remove a variable.
# For some reason, when I round to 3 digits, the elbow plot no longer suggests K = 7. This is very surprising. So I've decided to keep it rounding to 4 digits, because I have done so much work for K = 7.
role <- nba %>%
  select(Name, POS, Team, Height, Weight, PTSPerMin, ASTPerMin, TOPerMin, STLPerMin, ORPerMin, DRPerMin, BLKPerMin, PFPerMin, FGP, FGMPerMin, FGAPerMin, `3PP`, `3PMPerMin`, `3PAPerMin`, FTP, FTMPerMin, FTAPerMin)

# standardizing the data for KMeans
roleKMeans_prep <- role %>%
  mutate(across(where(is.numeric), standardize))

# displaying the standardized data for student
roleKMeans_prep %>%
  slice(1:5) %>%
  mutate(across(where(is.numeric), round, digits = 3)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:5), width = .6) %>%
  width(j = c(6:12), width = .95)

Name	POS	Team	Height	Weight	PTSPerMin	ASTPerMin	TOPerMin	STLPerMin	ORPerMin	DRPerMin	BLKPerMin	PFPerMin	FGP	FGMPerMin	FGAPerMin	3PP	3PMPerMin	3PAPerMin	FTP	FTMPerMin	FTAPerMin
Trae Young	PG	atl	-1.655	-1.496	2.824	3.171	2.751	-0.499	-0.698	-0.944	-1.083	-1.288	-0.085	2.237	2.300	0.525	1.354	1.164	1.376	3.112	2.499
John Collins	PF	atl	0.842	0.774	0.620	-0.708	-0.773	-1.019	0.269	1.024	0.734	0.474	0.820	0.857	0.313	0.357	-0.356	-0.443	0.259	0.360	0.275
Bogdan Bogdanovic	SG	atl	-0.094	0.155	0.539	0.129	-0.692	0.469	-0.780	-0.392	-0.840	-0.457	-0.483	0.426	0.757	0.394	1.468	1.427	0.762	-0.405	-0.529
De'Andre Hunter	SF	atl	0.530	0.362	0.036	-0.970	-0.420	-0.689	-0.788	-0.852	-0.435	0.471	-0.332	-0.069	0.069	0.497	-0.081	-0.219	-0.022	0.343	0.344
Kevin Huerter	SG	atl	0.218	-1.083	-0.277	-0.129	-0.558	-0.676	-0.877	-0.719	-0.429	0.005	-0.168	-0.118	-0.078	0.590	0.857	0.637	0.410	-1.193	-1.304

# finishing prepping data for KMeans procedure
roleKMeans_prep <- roleKMeans_prep %>%
  column_to_rownames(var = "Name") %>%
  select(-Team, -POS)

Let’s check our Elbow plot to get an idea of the clustering.

# removing text for visualizations and standardizing
role_rm <- role %>%
  select(-Name, -POS, -Team) %>%
  mutate(across(where(is.numeric), standardize))

fviz_nbclust(role_rm, kmeans, method = "wss", k.max = 24) +
  theme_minimal() +
  labs(title = "The Elbow Method")

Exercise 11

a) What do you see from the Elbow plot? At what point do the returns diminish?

b) How many clusters does the Elbow plot suggest?

# creates consensus clusters
roleClust <- n_clusters(role_rm,
                        package = c("easystats", "NbClust"),
                        standardize = FALSE, n_max = 10)

plot(roleClust) +
  labs(title = "Optimal Number of Clusters", x = "")

There’s a lot of variation in the preferred number of clusters. How many clusters would you choose to analyze? How many values of K would you like to analyze? This is totally up to you. Feel free to move back and forth through this section to analyze the data as much as you like.

Exercise 12 (Maybe a final analysis for them to do?)

We will be using K = 7 for the trade scenario portion, so we recommend you review through K = 7.

give them space to choose

# assume that they want K = 7.
stu_cluster <- 7

Ok, you’ve chosen K = 7. Here is an empty table for you to describe each of the clusters. As you grow in understanding of each of the clusters, fill it out with a few distinguishing words. Make sure you can glance at the table and understand what separates one cluster from another.

stu_role_table <- tibble(
  Cluster = 1:stu_cluster,
  Description = "")

stu_role_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster	Description
1
2
3
4
5
6
7

We’ll begin by looking at the mean for each variable of a cluster. Remember, this can help us identify variables that are not useful and get a general understanding of the characteristics of each cluster.

There may be a lot of variables, so we flipped the coordinates of the plot to make it easier to read. A bar to the right indicates a positive association and a bar to the left indicates a negative association.

set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_cluster, nstart = 50)

# creating factor levels for role
role_levels <- colnames(role)

# creates a dataset of each variable and the standardized center and graphs it
as_tibble(roleKMeans$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
  mutate(variable = factor(variable, role_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  geom_hline(yintercept = 0) +
  facet_grid(cols = vars(cluster), switch = "both") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.x = element_blank(),
        legend.position = "none")

Sift through the variables to see if any are unused throughout the clusters. If so, this indicates that the variable does not help differentiate the data into clusters. You can remove it here:

# if the student wants to remove a variable enter it here
role_var_rm <- 0

# reproducing roleKMeans without the removed variables
set.seed(100)
roleKMeans <- roleKMeans_prep %>%
  select(-all_of(role_var_rm)) %>%
  kmeans(centers = stu_cluster, nstart = 50)


role <- role %>% select(-role_var_rm)

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(role_var_rm)

  # Now:
  data %>% select(all_of(role_var_rm))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

role_rm <- role_rm %>% select(-role_var_rm)

If you chose a large number of clusters, it may be difficult to use this visualization to remove unimportant variables. Instead, you should be able to see some of the important attributes of each of the clusters. Be thinking of identifiers for each cluster. Which variables are important throughout?

Let’s begin to analyze the numeric values of the centers. Look through each cluster’s characteristics. What sticks out to you?

role_summary <- role %>% summarise(
  across(where(is.numeric), mean)) %>%
  mutate(
    Clusters = "Data Average"
  ) %>% relocate(Clusters)

roleKcenters <- as_tibble(roleKMeans$cluster) %>%
  mutate(Name = role$Name) %>%
  rename(Clusters = value) %>% left_join(role, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(
    Clusters = as.character(Clusters)
  ) %>% bind_rows(role_summary) %>%
  mutate(across(where(is.numeric), round, digits = 3),
         Height = round(Height, digits = 1),
         Weight = round(Height, digits = 1))

roleKcenters %>%
  reactable(
    defaultColDef = colDef(
      cell = color_tiles(.)))

Which clusters are scorers? Which are rebounders? Which have higher assist numbers? Higher 3-point shooting? Are any two clusters similar? What differentiates them?

At this point, give a short descriptor of each cluster. Each cluster should be uniquely described.

Let’s look at the size of each cluster.

roleKMeans$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster	Size
1	51
2	45
3	96
4	26
5	20
6	38
7	98

Does this surprise you? Which clusters are large and small? Does this fit with your perception of the makeup of NBA teams?

Let’s look at the distribution of the players.

rolefviz <- fviz_cluster(roleKMeans, roleKMeans_prep,
                         geom = "point",
                         show.clust.cent = TRUE, stand = FALSE,
                         pointsize = 1,
                         main = "Role K Clusters")
rolefviz

What do you notice from the visualization? Remember, the dimensions cannot represent all the data, so we may have clusters that overlap. Imagine that there is a third dimension “Z” that explains another 30%-40% of the data.

Where are the cluster centers and outliers? Which clusters seem to be the closest together? Furthest away? Are any clusters more isolated than others? Is this supported by your previous analysis?

If you had to add another cluster where would it be? If you had to remove a cluster, where would it be?

Let’s look at our prototype and outlier analysis.

First, we need to verify that our prototypes and outliers are prototypes and outliers. Now that we can change the number of clusters, its possible that you have some pretty small clusters. With a smaller sample size, we want to ensure that all our prototypes are indeed close to the cluster center and that all our outliers are indeed far away. In our K = 2 usage analysis, our prototypes were about 1-2.3 units away from the center. Our outliers were about 6-8.5. However, as K increases, the outlier distances should fall. Let’s look at the distances from the center of our top 3 prototypes and outliers from each cluster to see how they compare.

# standardizing the distances between the players
roleKMeans_scale <- as_tibble(roleKMeans$centers) %>%
  mutate(cluster = 1:n())

# creating appropriate tibble for distance formula
role_fittedKMeans <- roleKMeans$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(roleKMeans_scale) %>% select(-cluster)

Joining with `by = join_by(cluster)`

# distance from cluster center
distances <- sqrt(rowSums((role_rm - role_fittedKMeans)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = role$Name,
    Cluster = roleKMeans$cluster)

master_distances <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>%
  arrange(Cluster, distance) %>%
  select(-outlier_rank, -proto_rank) %>%
  relocate(distance, .after = Category) %>%
  relocate(Name, .after = Category)

master_distances %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 3, width = 1.3)

Cluster	Category	Name	distance
1	Prototype	Trendon Watford	1.8523
1	Prototype	Isaiah Roby	1.8794
1	Prototype	John Collins	2.0773
1	Outlier	Isaiah Jackson	5.3644
1	Outlier	Tristan Thompson	6.8757
1	Outlier	Jakob Poeltl	7.1035
2	Prototype	Eric Bledsoe	1.6470
2	Prototype	Marcus Smart	1.6596
2	Prototype	Raul Neto	1.7083
2	Outlier	Josh Giddey	3.7481
2	Outlier	Jose Alvarado	3.8303
2	Outlier	Draymond Green	5.0848
3	Prototype	Coby White	1.3340
3	Prototype	Saddiq Bey	1.4581
3	Prototype	Lonnie Walker IV	1.4612
3	Outlier	Mike Muscala	4.0470
3	Outlier	Klay Thompson	4.1396
3	Outlier	Kevin Love	4.4379
4	Prototype	Ivica Zubac	1.9336
4	Prototype	Bismack Biyombo	1.9389
4	Prototype	Nic Claxton	2.3272
4	Outlier	Rudy Gobert	4.6189
4	Outlier	JaVale McGee	4.6530
4	Outlier	Thaddeus Young	5.0444
5	Prototype	Karl-Anthony Towns	2.0831
5	Prototype	Pascal Siakam	2.4719
5	Prototype	Jonas Valanciunas	2.5886
5	Outlier	Giannis Antetokounmpo	5.5937
5	Outlier	Joel Embiid	5.9766
5	Outlier	DeMarcus Cousins	6.0388
6	Prototype	Khris Middleton	1.5916
6	Prototype	Bradley Beal	1.6040
6	Prototype	Jaylen Brown	1.8622
6	Outlier	James Harden	4.0727
6	Outlier	Luka Doncic	4.0863
6	Outlier	Trae Young	4.1980
7	Prototype	Torrey Craig	1.3072
7	Prototype	Torrey Craig1	1.6834
7	Prototype	CJ Elleby	1.7221
7	Outlier	Xavier Tillman	4.2227
7	Outlier	Thaddeus Young1	4.2333
7	Outlier	Gary Payton II	5.0684

Which prototypes are the strongest prototypes? Which prototypes do you trust the most? Which are the strongest outliers? Would you disqualify any outliers or prototypes from the analysis (i.e. a supposed outlier is not far enough from the center or a labeled prototype is too far from the center).

Is this too long? I could remove the two long outliers table and only use the shorter one?

If you wish to disqualify a player from analysis, do it here:

Provide a space for the student to remove player’s from the analysis. Assume student disqualifies Nic Claxton. Just for the heck of it.

disqualify <- c("Nic Claxton")

roleKMeans$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster	Size
1	51
2	45
3	96
4	26
5	20
6	38
7	98

Look again at the size of each cluster. Does this help explain any of your findings?

These outliers can be very different from each other. We’ll need to look into them to see what kind of players they are. Once again, we’ll show you the top 3 of each category first, and afterward a smaller table with only the top player.

# creating a master document with all of the prototypes and all of the outliers.
mast_dist_slice <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(role) %>% arrange(Cluster, desc(Category)) %>%
  filter(Name != disqualify)

Joining with `by = join_by(Name)`

mast_dist_slice %>%
  mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(4:7), width = .6) %>%
  width(j = c(8:14), width = .95)

Name	Cluster	Category	POS	Team	Height	Weight	PTSPerMin	ASTPerMin	TOPerMin	STLPerMin	ORPerMin	DRPerMin	BLKPerMin	PFPerMin	FGP	FGMPerMin	FGAPerMin	3PP	3PMPerMin	3PAPerMin	FTP	FTMPerMin	FTAPerMin
John Collins	1	Prototype	PF	atl	81	235	0.526	0.058	0.036	0.019	0.055	0.198	0.032	0.097	0.526	0.205	0.386	0.364	0.039	0.107	0.793	0.081	0.101
Isaiah Roby	1	Prototype	PF	okc	80	230	0.479	0.076	0.047	0.038	0.081	0.152	0.038	0.114	0.514	0.175	0.341	0.444	0.047	0.104	0.672	0.081	0.123
Trendon Watford	1	Prototype	PF	por	81	240	0.420	0.094	0.050	0.028	0.066	0.166	0.033	0.133	0.532	0.166	0.309	0.237	0.011	0.044	0.755	0.083	0.110
Isaiah Jackson	1	Outlier	F	ind	82	205	0.553	0.020	0.073	0.047	0.113	0.167	0.093	0.173	0.563	0.213	0.380	0.313	0.007	0.027	0.682	0.113	0.160
Tristan Thompson	1	Outlier	C	sac	81	254	0.408	0.039	0.066	0.026	0.158	0.197	0.026	0.112	0.503	0.164	0.329	1.000	0.000	0.000	0.533	0.072	0.132
Jakob Poeltl	1	Outlier	C	sa	85	245	0.466	0.097	0.055	0.024	0.134	0.190	0.059	0.107	0.618	0.207	0.338	1.000	0.000	0.000	0.495	0.048	0.097
Marcus Smart	2	Prototype	PG	bos	75	220	0.375	0.183	0.068	0.053	0.019	0.099	0.009	0.071	0.418	0.130	0.313	0.331	0.053	0.158	0.793	0.062	0.077
Eric Bledsoe	2	Prototype	SG	lac	73	214	0.393	0.167	0.083	0.052	0.020	0.115	0.016	0.063	0.421	0.143	0.345	0.313	0.036	0.119	0.761	0.063	0.087
Raul Neto	2	Prototype	PG	wsh	73	180	0.383	0.158	0.056	0.041	0.010	0.087	0.000	0.077	0.463	0.148	0.321	0.292	0.026	0.087	0.769	0.061	0.077
Draymond Green	2	Outlier	PF	gs	78	230	0.260	0.242	0.104	0.045	0.035	0.218	0.038	0.104	0.525	0.100	0.194	0.296	0.010	0.042	0.659	0.045	0.069
Jose Alvarado	2	Outlier	PG	no	72	179	0.396	0.182	0.045	0.084	0.032	0.091	0.006	0.091	0.446	0.156	0.351	0.291	0.039	0.130	0.679	0.045	0.065
Josh Giddey	2	Outlier	SG	okc	80	205	0.397	0.203	0.102	0.029	0.057	0.190	0.013	0.051	0.419	0.165	0.394	0.263	0.032	0.124	0.709	0.032	0.048
Coby White	3	Prototype	PG	chi	77	195	0.462	0.105	0.040	0.018	0.011	0.098	0.007	0.080	0.433	0.167	0.385	0.385	0.080	0.211	0.857	0.047	0.055
Saddiq Bey	3	Prototype	SF	det	79	215	0.488	0.085	0.036	0.027	0.039	0.124	0.006	0.048	0.396	0.167	0.421	0.346	0.079	0.224	0.827	0.079	0.094
Lonnie Walker IV	3	Prototype	G	sa	76	204	0.526	0.096	0.043	0.026	0.013	0.100	0.013	0.061	0.407	0.191	0.474	0.314	0.070	0.217	0.784	0.074	0.091
Kevin Love	3	Outlier	PF	cle	80	251	0.604	0.098	0.058	0.018	0.053	0.271	0.009	0.062	0.430	0.196	0.458	0.392	0.111	0.284	0.838	0.098	0.120
Klay Thompson	3	Outlier	SG	gs	78	215	0.694	0.095	0.044	0.017	0.017	0.116	0.017	0.058	0.429	0.262	0.609	0.385	0.122	0.316	0.902	0.048	0.054
Mike Muscala	3	Outlier	C	okc	82	240	0.580	0.036	0.022	0.029	0.036	0.181	0.043	0.094	0.456	0.188	0.420	0.429	0.116	0.275	0.842	0.080	0.094
Ivica Zubac	4	Prototype	C	lac	84	240	0.422	0.066	0.061	0.020	0.119	0.230	0.041	0.111	0.626	0.168	0.266	0.000	0.000	0.000	0.727	0.090	0.123
Bismack Biyombo	4	Prototype	C	phx	80	255	0.411	0.043	0.050	0.021	0.128	0.206	0.050	0.135	0.593	0.170	0.284	0.000	0.000	0.000	0.535	0.078	0.142
JaVale McGee	4	Outlier	C	phx	84	270	0.582	0.038	0.082	0.019	0.139	0.285	0.070	0.152	0.629	0.247	0.392	0.222	0.000	0.006	0.699	0.089	0.127
Thaddeus Young	4	Outlier	PF	sa	80	235	0.430	0.162	0.085	0.063	0.106	0.141	0.021	0.106	0.578	0.197	0.345	0.000	0.000	0.014	0.455	0.028	0.056
Rudy Gobert	4	Outlier	C	utah	85	258	0.486	0.034	0.056	0.022	0.115	0.343	0.065	0.084	0.713	0.171	0.240	0.000	0.000	0.003	0.690	0.143	0.209
Karl-Anthony Towns	5	Prototype	C	min	83	248	0.737	0.108	0.093	0.030	0.078	0.216	0.033	0.108	0.529	0.260	0.491	0.410	0.060	0.147	0.822	0.156	0.189
Jonas Valanciunas	5	Prototype	C	no	83	265	0.587	0.086	0.079	0.020	0.102	0.274	0.026	0.109	0.544	0.228	0.419	0.361	0.026	0.069	0.820	0.106	0.129
Pascal Siakam	5	Prototype	PF	tor	81	230	0.602	0.140	0.071	0.034	0.050	0.174	0.016	0.087	0.494	0.232	0.470	0.344	0.029	0.084	0.749	0.111	0.148
DeMarcus Cousins	5	Outlier	C	den	82	270	0.640	0.122	0.158	0.043	0.115	0.281	0.029	0.216	0.456	0.216	0.475	0.324	0.058	0.173	0.736	0.151	0.201
Giannis Antetokounmpo	5	Outlier	PF	mil	83	242	0.909	0.176	0.100	0.033	0.061	0.292	0.043	0.097	0.553	0.313	0.565	0.293	0.033	0.109	0.722	0.252	0.347
Joel Embiid	5	Outlier	C	phi	84	280	0.905	0.124	0.092	0.033	0.062	0.284	0.044	0.080	0.499	0.290	0.580	0.371	0.041	0.109	0.814	0.284	0.349
Jaylen Brown	6	Prototype	SG	bos	78	223	0.702	0.104	0.080	0.033	0.024	0.158	0.009	0.074	0.473	0.259	0.548	0.358	0.074	0.208	0.758	0.110	0.143
Khris Middleton	6	Prototype	SF	mil	79	222	0.620	0.167	0.090	0.037	0.019	0.148	0.009	0.074	0.443	0.210	0.478	0.373	0.077	0.204	0.890	0.120	0.136
Bradley Beal	6	Prototype	SG	wsh	75	207	0.644	0.183	0.094	0.025	0.028	0.106	0.011	0.067	0.451	0.242	0.536	0.300	0.044	0.147	0.833	0.117	0.142
Trae Young	6	Outlier	PG	atl	73	180	0.814	0.278	0.115	0.026	0.020	0.089	0.003	0.049	0.460	0.269	0.582	0.382	0.089	0.229	0.904	0.189	0.209
James Harden	6	Outlier	SG	bkn	77	220	0.608	0.276	0.130	0.035	0.027	0.189	0.019	0.065	0.414	0.178	0.432	0.332	0.062	0.189	0.869	0.186	0.216
Luka Doncic	6	Outlier	PG	dal	79	230	0.802	0.246	0.127	0.034	0.025	0.234	0.017	0.062	0.457	0.280	0.610	0.353	0.088	0.249	0.744	0.158	0.212
Torrey Craig	7	Prototype	SF	ind	79	221	0.320	0.054	0.039	0.025	0.059	0.133	0.020	0.094	0.456	0.123	0.271	0.333	0.044	0.133	0.771	0.025	0.034
Torrey Craig1	7	Prototype	SF	phx	79	221	0.332	0.058	0.048	0.038	0.048	0.159	0.029	0.101	0.450	0.130	0.284	0.323	0.053	0.173	0.706	0.019	0.029
CJ Elleby	7	Prototype	SG	por	78	200	0.287	0.074	0.050	0.030	0.054	0.139	0.015	0.099	0.393	0.104	0.262	0.294	0.030	0.109	0.714	0.050	0.069
Gary Payton II	7	Outlier	SG	gs	75	195	0.403	0.051	0.034	0.080	0.057	0.142	0.017	0.102	0.616	0.170	0.273	0.358	0.034	0.097	0.603	0.028	0.045
Xavier Tillman	7	Outlier	C	mem	80	245	0.364	0.091	0.045	0.068	0.091	0.136	0.023	0.091	0.454	0.136	0.311	0.204	0.015	0.068	0.648	0.068	0.098
Thaddeus Young1	7	Outlier	PF	tor	80	235	0.344	0.093	0.044	0.066	0.082	0.158	0.022	0.093	0.465	0.142	0.301	0.395	0.038	0.093	0.481	0.027	0.055

Below is the smaller table.

mast_dist_slice1 <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 2 | proto_rank < 2) %>%
  mutate(
    Category = if_else(proto_rank < 2, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(role) %>% arrange(desc(Category), Cluster) %>%
  filter(Name != disqualify)

Joining with `by = join_by(Name)`

mast_dist_slice1 %>%
  mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(4:7), width = .6) %>%
  width(j = c(8:14), width = .95)

Name	Cluster	Category	POS	Team	Height	Weight	PTSPerMin	ASTPerMin	TOPerMin	STLPerMin	ORPerMin	DRPerMin	BLKPerMin	PFPerMin	FGP	FGMPerMin	FGAPerMin	3PP	3PMPerMin	3PAPerMin	FTP	FTMPerMin	FTAPerMin
Trendon Watford	1	Prototype	PF	por	81	240	0.420	0.094	0.050	0.028	0.066	0.166	0.033	0.133	0.532	0.166	0.309	0.237	0.011	0.044	0.755	0.083	0.110
Eric Bledsoe	2	Prototype	SG	lac	73	214	0.393	0.167	0.083	0.052	0.020	0.115	0.016	0.063	0.421	0.143	0.345	0.313	0.036	0.119	0.761	0.063	0.087
Coby White	3	Prototype	PG	chi	77	195	0.462	0.105	0.040	0.018	0.011	0.098	0.007	0.080	0.433	0.167	0.385	0.385	0.080	0.211	0.857	0.047	0.055
Ivica Zubac	4	Prototype	C	lac	84	240	0.422	0.066	0.061	0.020	0.119	0.230	0.041	0.111	0.626	0.168	0.266	0.000	0.000	0.000	0.727	0.090	0.123
Karl-Anthony Towns	5	Prototype	C	min	83	248	0.737	0.108	0.093	0.030	0.078	0.216	0.033	0.108	0.529	0.260	0.491	0.410	0.060	0.147	0.822	0.156	0.189
Khris Middleton	6	Prototype	SF	mil	79	222	0.620	0.167	0.090	0.037	0.019	0.148	0.009	0.074	0.443	0.210	0.478	0.373	0.077	0.204	0.890	0.120	0.136
Torrey Craig	7	Prototype	SF	ind	79	221	0.320	0.054	0.039	0.025	0.059	0.133	0.020	0.094	0.456	0.123	0.271	0.333	0.044	0.133	0.771	0.025	0.034
Jakob Poeltl	1	Outlier	C	sa	85	245	0.466	0.097	0.055	0.024	0.134	0.190	0.059	0.107	0.618	0.207	0.338	1.000	0.000	0.000	0.495	0.048	0.097
Draymond Green	2	Outlier	PF	gs	78	230	0.260	0.242	0.104	0.045	0.035	0.218	0.038	0.104	0.525	0.100	0.194	0.296	0.010	0.042	0.659	0.045	0.069
Kevin Love	3	Outlier	PF	cle	80	251	0.604	0.098	0.058	0.018	0.053	0.271	0.009	0.062	0.430	0.196	0.458	0.392	0.111	0.284	0.838	0.098	0.120
Thaddeus Young	4	Outlier	PF	sa	80	235	0.430	0.162	0.085	0.063	0.106	0.141	0.021	0.106	0.578	0.197	0.345	0.000	0.000	0.014	0.455	0.028	0.056
DeMarcus Cousins	5	Outlier	C	den	82	270	0.640	0.122	0.158	0.043	0.115	0.281	0.029	0.216	0.456	0.216	0.475	0.324	0.058	0.173	0.736	0.151	0.201
Trae Young	6	Outlier	PG	atl	73	180	0.814	0.278	0.115	0.026	0.020	0.089	0.003	0.049	0.460	0.269	0.582	0.382	0.089	0.229	0.904	0.189	0.209
Gary Payton II	7	Outlier	SG	gs	75	195	0.403	0.051	0.034	0.080	0.057	0.142	0.017	0.102	0.616	0.170	0.273	0.358	0.034	0.097	0.603	0.028	0.045

Look through the prototypes and outliers. Compare their results with your previous findings. Do the prototypes of each cluster match up with your summary of the cluster? How do the outliers fit in? Two outliers can be very different. Pick a few outliers and determine their closest two clusters.

rolefviz

Analyze the K = 7 clusters as a whole. Are the clusters good? Do they have high intra-class similarity? What about a low intra-class similarity? If you were to do the analysis again, would you choose the same amount of clusters?

Compare lots of Ks

Select two values of K (between 2 and 10) to compare. This table can become very complex. Remember, the rows are the cluster assignment with the first value of K and the columns are the cluster assignment with the second value. Isolate and analyze one row or column at a time.

# let's say the student wants to compare K = 3 and K = 7
stu_clus1 <- 7
stu_clus2 <- 3
# ensures that the first chosen cluster is lower.
if(stu_clus1 > stu_clus2) {
  space = stu_clus1
  stu_clus1 = stu_clus2
  stu_clus2 = space
}

set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_clus1, nstart = 50)
set.seed(100)
roleK2Means <- kmeans(roleKMeans_prep, centers = stu_clus2, nstart = 50)

# creating a tibble of the cluster of each player for each K
clusters <- tibble(
  player = role$Name,
  Cluster = roleKMeans$cluster,
  clusK2 = roleK2Means$cluster
)

compare_table <- with(clusters, table(Cluster, clusK2)) %>%
  as_tibble() %>%
  pivot_wider(names_from = clusK2, values_from = n)

# tabulating clusters
compare_table %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster	1	2	3	4	5	6	7
1	0	9	17	0	8	38	0
2	1	35	79	0	0	0	90
3	50	1	0	26	12	0	8

Part 7: GM of Dallas Mavericks

Returning back to the Dallas Mavericks. Let’s take a look at how the Mavericks players were clustered in our role dataset. Let’s use K = 7. If you did not analyze K = 7 earlier, it is worth a look.

Below are a few visual reminders of each cluster’s characteristics.

# initializing our datasets a third time in case student decided to remove a variable
role <- nba %>%
  select(Name, POS, Team, Height, Weight, FGP, `3PP`, FTP,  PTSPerMin, ORPerMin, DRPerMin, ASTPerMin, STLPerMin, BLKPerMin, TOPerMin, PFPerMin, FGMPerMin, FGAPerMin, `3PMPerMin`, `3PAPerMin`, FTMPerMin, FTAPerMin) %>%
  mutate(across(where(is.numeric), round, digits = 4))

# standardizing the data for KMeans
roleKMeans_prep <- role %>%
  mutate(across(where(is.numeric), standardize)) %>%
  column_to_rownames(var = "Name") %>%
  select(-Team, -POS)

# creating K = 7 K-Means
set.seed(100)
role7Means <- kmeans(roleKMeans_prep, centers = 7, nstart = 50)

# bar graph of centers
as_tibble(role7Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
  mutate(variable = factor(variable, role_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0) +
  coord_flip() +
  facet_grid(cols = vars(cluster), switch = "both") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.x = element_blank(),
        legend.position = "none")

# creating tibble of all the centers
role7centers <- as_tibble(role7Means$cluster) %>%
  mutate(Name = role$Name) %>%
  rename(Clusters = value) %>% left_join(role, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)) %>%
  mutate(Clusters = as.character(Clusters)) %>%
  bind_rows(role_summary) %>%
  mutate(across(where(is.numeric), round, digits = 3),
         Height = round(Height, digits = 1),
         Weight = round(Height, digits = 1))

# printing conditional formatting table
role7centers %>%
  reactable(
    defaultColDef = colDef(
      cell = color_tiles(.)
    ))

Before moving on, fill out this table to describe each cluster. Write a few descriptive words that distinguish each cluster. This will help you to organize your thoughts on each cluster. If you already completed this for K = 7 in the role dataset, then you are free to proceed.

stu_table <- tibble(
  Cluster = 1:7,
  Description = "")

stu_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster	Description
1
2
3
4
5
6
7

Caleb’s estimation of 7 clusters. I’d like to provide them a blank table to fill out somehow. Like a text file table with two columns.

Caleb_table <- tibble(
  Cluster = 1:7,
  Description = c("big men, mediocre scorers, kinda shoot deep",
                  "small point guards, facilitaters",
                  "meh players, 3 point shooters",
                  "big men, can't shoot deep at all",
                  "high-volume players, generally tall",
                  "high-volume players, average height",
                  "low production, very mediocre, likely corner 3 players"))

Caleb_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster	Description
1	big men, mediocre scorers, kinda shoot deep
2	small point guards, facilitaters
3	meh players, 3 point shooters
4	big men, can't shoot deep at all
5	high-volume players, generally tall
6	high-volume players, average height
7	low production, very mediocre, likely corner 3 players

Mavericks Offseason Analysis

Now, let’s look at the cluster assignments of our ten Dallas Mavericks players.

role7Means_players <- role7Means$cluster %>%
  as_tibble() %>%
  rename(Cluster = value) %>%
  mutate(
    Name = role$Name
  ) %>%
  left_join(role, by = "Name") %>%
  left_join(usage %>% select(Name, MIN), by = "Name") %>%
  relocate(Cluster, .after = Name) %>%
  relocate(MIN, .after = POS) %>%
  arrange(Cluster)

dallas_role2022 <- role7Means_players %>%
  filter(Team == "dal") %>%
  select(-Team)

dallas_role2022 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:10), width = .6) %>%
  width(j = c(11:14), width = .95)

Name	Cluster	POS	MIN	Height	Weight	FGP	3PP	FTP	PTSPerMin	ORPerMin	DRPerMin	ASTPerMin	STLPerMin	BLKPerMin	TOPerMin	PFPerMin	FGMPerMin	FGAPerMin	3PMPerMin	3PAPerMin	FTMPerMin	FTAPerMin
Dwight Powell	1	C	21.9	82	240	0.671	0.351	0.783	0.3973	0.0959	0.1279	0.0548	0.0228	0.0228	0.0365	0.1233	0.1507	0.2237	0.0091	0.0228	0.0913	0.1187
Jalen Brunson	2	PG	31.9	73	190	0.502	0.373	0.840	0.5110	0.0157	0.1066	0.1505	0.0251	0.0000	0.0502	0.0596	0.2006	0.4013	0.0376	0.1003	0.0721	0.0846
Tim Hardaway Jr.	3	SF	29.6	77	205	0.394	0.336	0.757	0.4797	0.0101	0.1149	0.0743	0.0304	0.0034	0.0270	0.0608	0.1689	0.4257	0.0811	0.2432	0.0642	0.0845
Kristaps Porzingis	5	C	29.5	87	240	0.451	0.283	0.865	0.6508	0.0644	0.1966	0.0678	0.0237	0.0576	0.0542	0.0881	0.2271	0.5051	0.0475	0.1729	0.1458	0.1695
Luka Doncic	6	PG	35.4	79	230	0.457	0.353	0.744	0.8023	0.0254	0.2345	0.2458	0.0339	0.0169	0.1271	0.0621	0.2797	0.6102	0.0876	0.2486	0.1582	0.2119
Dorian Finney-Smith	7	PF	33.1	79	220	0.471	0.395	0.675	0.3323	0.0453	0.0967	0.0574	0.0332	0.0151	0.0302	0.0695	0.1239	0.2628	0.0665	0.1631	0.0211	0.0302
Reggie Bullock	7	SF	28.0	78	205	0.401	0.360	0.833	0.3071	0.0179	0.1107	0.0429	0.0214	0.0071	0.0214	0.0571	0.1071	0.2643	0.0750	0.2071	0.0214	0.0250
Maxi Kleber	7	PF	24.6	82	240	0.398	0.325	0.708	0.2846	0.0488	0.1911	0.0488	0.0203	0.0407	0.0325	0.0935	0.0976	0.2439	0.0569	0.1748	0.0325	0.0447
Josh Green	7	SG	15.5	77	200	0.508	0.359	0.689	0.3097	0.0516	0.1032	0.0774	0.0452	0.0129	0.0452	0.1097	0.1226	0.2452	0.0258	0.0774	0.0323	0.0452
Sterling Brown	7	SF	12.8	77	219	0.381	0.304	0.933	0.2578	0.0391	0.1953	0.0547	0.0234	0.0078	0.0391	0.0859	0.0937	0.2500	0.0469	0.1484	0.0234	0.0234

What do you notice about the player assignments? How many clusters do the Mavericks have represented? Which cluster is the most common on the Mavericks team?

Why is cluster 7 the most common? What kind of player is in cluster 7?

The Mavericks experienced a bit of turnover in the 2022 offseason. They’d already traded away C Kristaps Porzingis for SG Spencer Dinwiddie at the end of the 2022 season, and they lost productive SG Jalen Brunson to free agency. They traded away SF Sterling Brown and other assets for C Christian Wood during the 2022 Summer.

Let’s assess the offseason moves of the Dallas Mavericks by looking at the opening day roster for 2023 and its cluster distribution. Below are the eleven players on the Dallas Mavericks roster at Game 1 of the 2023 season, a loss against the Phoenix Suns.

dallas_role2023 <- role7Means_players %>%
  filter(Name == "JaVale McGee" | Name == "Reggie Bullock" | Name == "Dorian Finney-Smith" | Name == "Spencer Dinwiddie" | Name == "Luka Doncic" | Name == "Tim Hardaway Jr." | Name == "Maxi Kleber" | Name == "Christian Wood" | Name == "Josh Green" | Name == "Dwight Powell" | Name == "Davis Bertans") %>%
  select(-Team) %>%
  arrange(Cluster)

dallas_role2023 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:10), width = .6) %>%
  width(j = c(11:14), width = .95)

Name	Cluster	POS	MIN	Height	Weight	FGP	3PP	FTP	PTSPerMin	ORPerMin	DRPerMin	ASTPerMin	STLPerMin	BLKPerMin	TOPerMin	PFPerMin	FGMPerMin	FGAPerMin	3PMPerMin	3PAPerMin	FTMPerMin	FTAPerMin
Dwight Powell	1	C	21.9	82	240	0.671	0.351	0.783	0.3973	0.0959	0.1279	0.0548	0.0228	0.0228	0.0365	0.1233	0.1507	0.2237	0.0091	0.0228	0.0913	0.1187
Tim Hardaway Jr.	3	SF	29.6	77	205	0.394	0.336	0.757	0.4797	0.0101	0.1149	0.0743	0.0304	0.0034	0.0270	0.0608	0.1689	0.4257	0.0811	0.2432	0.0642	0.0845
Spencer Dinwiddie	3	PG	30.2	77	215	0.376	0.310	0.811	0.4172	0.0265	0.1291	0.1921	0.0199	0.0066	0.0563	0.0795	0.1391	0.3709	0.0530	0.1689	0.0861	0.1093
Davis Bertans	3	SF	14.7	82	225	0.351	0.319	0.933	0.3878	0.0136	0.1088	0.0340	0.0204	0.0136	0.0272	0.1088	0.1224	0.3401	0.0952	0.2857	0.0544	0.0612
JaVale McGee	4	C	15.8	84	270	0.629	0.222	0.699	0.5823	0.1392	0.2848	0.0380	0.0190	0.0696	0.0823	0.1519	0.2468	0.3924	0.0000	0.0063	0.0886	0.1266
Christian Wood	5	C	30.8	82	214	0.501	0.390	0.623	0.5812	0.0519	0.2760	0.0747	0.0260	0.0325	0.0617	0.0812	0.2110	0.4188	0.0617	0.1591	0.0974	0.1591
Luka Doncic	6	PG	35.4	79	230	0.457	0.353	0.744	0.8023	0.0254	0.2345	0.2458	0.0339	0.0169	0.1271	0.0621	0.2797	0.6102	0.0876	0.2486	0.1582	0.2119
Dorian Finney-Smith	7	PF	33.1	79	220	0.471	0.395	0.675	0.3323	0.0453	0.0967	0.0574	0.0332	0.0151	0.0302	0.0695	0.1239	0.2628	0.0665	0.1631	0.0211	0.0302
Reggie Bullock	7	SF	28.0	78	205	0.401	0.360	0.833	0.3071	0.0179	0.1107	0.0429	0.0214	0.0071	0.0214	0.0571	0.1071	0.2643	0.0750	0.2071	0.0214	0.0250
Maxi Kleber	7	PF	24.6	82	240	0.398	0.325	0.708	0.2846	0.0488	0.1911	0.0488	0.0203	0.0407	0.0325	0.0935	0.0976	0.2439	0.0569	0.1748	0.0325	0.0447
Josh Green	7	SG	15.5	77	200	0.508	0.359	0.689	0.3097	0.0516	0.1032	0.0774	0.0452	0.0129	0.0452	0.1097	0.1226	0.2452	0.0258	0.0774	0.0323	0.0452

The roster looks somewhat similar, but what classification of player did the Mavericks lose in the 2022 season and not return in the 2023 season? What classification of player did the Mavericks gain in the 2023 season?

Answer: They lost a cluster 2 player, lost a cluster 7 player, gained two cluster 3 players, and a cluster 4 player.

What kind of player is in cluster 2? What would losing this kind of player do to a team?

Dallas Mavericks Trade

Let’s say you’re the GM of the Dallas Mavericks after game 1 of the 2022-2023 season. Which players would you consider trading and what cluster of player would you hope to acquire? Which players are you willing to give up?

Answer: I think the correct answer here is give up any of cluster 3 or 7 for a cluster 2. Maxi Kleber is the most expendable because he has some features of 1,4,5 and some of 7. And they have excess of these players.

Select four players you are willing to trade and one cluster that you are looking for.

# let's say the student is smart and chooses
trading <- c("Davis Bertans", "Spencer Dinwiddie", "Maxi Kleber", "Dwight Powell")
# and is looking for a player in cluster...
looking <- 2
looking_clus <- role7Means_players %>%
  filter(Cluster == looking)

looking_clus %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:5), width = .6) %>%
  width(j = c(6:12), width = .95)

Name	Cluster	POS	MIN	Team	Height	Weight	FGP	3PP	FTP	PTSPerMin	ORPerMin	DRPerMin	ASTPerMin	STLPerMin	BLKPerMin	TOPerMin	PFPerMin	FGMPerMin	FGAPerMin	3PMPerMin	3PAPerMin	FTMPerMin	FTAPerMin
Lou Williams	2	SG	14.3	atl	73	175	0.391	0.363	0.859	0.4406	0.0210	0.0909	0.1329	0.0350	0.0070	0.0559	0.0629	0.1538	0.3986	0.0490	0.1259	0.0839	0.0979
Dennis Schroder	2	PG	29.2	bos	75	172	0.440	0.349	0.848	0.4932	0.0205	0.0959	0.1438	0.0274	0.0034	0.0719	0.0822	0.1781	0.4075	0.0479	0.1336	0.0856	0.1027
Marcus Smart	2	PG	32.3	bos	75	220	0.418	0.331	0.793	0.3746	0.0186	0.0991	0.1827	0.0526	0.0093	0.0681	0.0712	0.1300	0.3127	0.0526	0.1579	0.0619	0.0774
Ish Smith	2	PG	13.8	cha	72	175	0.395	0.400	0.632	0.3261	0.0217	0.0870	0.1884	0.0362	0.0217	0.0725	0.0652	0.1449	0.3623	0.0217	0.0507	0.0217	0.0362
Lonzo Ball	2	PG	34.6	chi	78	190	0.423	0.423	0.750	0.3757	0.0289	0.1272	0.1474	0.0520	0.0260	0.0665	0.0694	0.1329	0.3150	0.0896	0.2139	0.0173	0.0231
Alex Caruso	2	SG	28.0	chi	76	186	0.398	0.333	0.795	0.2643	0.0286	0.1000	0.1429	0.0607	0.0143	0.0500	0.0929	0.0893	0.2214	0.0357	0.1107	0.0500	0.0643
Ricky Rubio	2	PG	28.5	cle	75	190	0.363	0.339	0.854	0.4596	0.0140	0.1298	0.2316	0.0491	0.0070	0.0912	0.0772	0.1544	0.4246	0.0596	0.1789	0.0912	0.1053
Brandon Goodwin	2	G	13.9	cle	72	180	0.416	0.345	0.632	0.3453	0.0288	0.1079	0.1799	0.0504	0.0000	0.0719	0.0791	0.1295	0.3094	0.0360	0.1079	0.0504	0.0791
Jalen Brunson	2	PG	31.9	dal	73	190	0.502	0.373	0.840	0.5110	0.0157	0.1066	0.1505	0.0251	0.0000	0.0502	0.0596	0.2006	0.4013	0.0376	0.1003	0.0721	0.0846
Facundo Campazzo	2	PG	18.2	den	70	195	0.361	0.301	0.769	0.2802	0.0220	0.0769	0.1868	0.0549	0.0220	0.0549	0.1044	0.0879	0.2527	0.0495	0.1648	0.0495	0.0659
Cory Joseph	2	PG	24.6	det	75	200	0.445	0.414	0.885	0.3252	0.0163	0.0894	0.1463	0.0244	0.0122	0.0528	0.0935	0.1098	0.2520	0.0407	0.0976	0.0610	0.0691
Killian Hayes	2	PG	25.0	det	77	195	0.383	0.263	0.770	0.2760	0.0200	0.1040	0.1680	0.0480	0.0200	0.0680	0.1120	0.1080	0.2800	0.0280	0.1000	0.0360	0.0440
Saben Lee	2	PG	16.3	det	74	183	0.390	0.233	0.789	0.3436	0.0307	0.1166	0.1779	0.0613	0.0184	0.0613	0.0736	0.1166	0.2945	0.0245	0.0982	0.0920	0.1166
Draymond Green	2	PF	28.9	gs	78	230	0.525	0.296	0.659	0.2595	0.0346	0.2180	0.2422	0.0450	0.0381	0.1038	0.1038	0.1003	0.1938	0.0104	0.0415	0.0450	0.0692
Kevin Porter Jr.	2	SG	31.3	hou	76	203	0.415	0.375	0.642	0.4984	0.0224	0.1182	0.1981	0.0351	0.0128	0.0990	0.0831	0.1757	0.4217	0.0799	0.2173	0.0639	0.1022
Josh Christopher	2	SG	18.0	hou	77	215	0.448	0.296	0.735	0.4389	0.0389	0.1000	0.1111	0.0500	0.0111	0.0833	0.0722	0.1667	0.3778	0.0444	0.1444	0.0611	0.0833
D.J. Augustin	2	G	15.0	hou	71	183	0.404	0.406	0.868	0.3600	0.0133	0.0667	0.1467	0.0200	0.0000	0.0867	0.0333	0.1067	0.2667	0.0733	0.1867	0.0667	0.0733
Tyrese Haliburton	2	PG	36.1	ind	77	185	0.502	0.416	0.849	0.4848	0.0222	0.0970	0.2659	0.0499	0.0166	0.0886	0.0526	0.1717	0.3435	0.0609	0.1468	0.0776	0.0914
T.J. McConnell	2	PG	24.1	ind	73	190	0.481	0.303	0.826	0.3527	0.0290	0.1079	0.2033	0.0456	0.0166	0.0456	0.0830	0.1535	0.3195	0.0166	0.0498	0.0290	0.0373
Keifer Sykes	2	G	17.7	ind	71	167	0.363	0.300	0.882	0.3164	0.0169	0.0678	0.1073	0.0226	0.0056	0.0565	0.0904	0.1243	0.3333	0.0452	0.1582	0.0282	0.0282
Eric Bledsoe	2	SG	25.2	lac	73	214	0.421	0.313	0.761	0.3929	0.0198	0.1151	0.1667	0.0516	0.0159	0.0833	0.0635	0.1429	0.3452	0.0357	0.1190	0.0635	0.0873
De'Anthony Melton	2	SG	22.7	mem	74	200	0.404	0.374	0.750	0.4758	0.0396	0.1586	0.1189	0.0617	0.0220	0.0661	0.0793	0.1674	0.4185	0.0837	0.2247	0.0529	0.0705
Tyus Jones	2	PG	21.2	mem	72	196	0.451	0.390	0.818	0.4104	0.0094	0.1038	0.2075	0.0425	0.0000	0.0283	0.0189	0.1604	0.3585	0.0519	0.1321	0.0330	0.0425
Kyle Lowry	2	PG	33.9	mia	72	196	0.440	0.377	0.851	0.3953	0.0147	0.1180	0.2212	0.0324	0.0088	0.0796	0.0826	0.1298	0.2950	0.0678	0.1799	0.0678	0.0826
Gabe Vincent	2	PG	23.4	mia	75	200	0.417	0.368	0.815	0.3718	0.0128	0.0641	0.1325	0.0385	0.0085	0.0598	0.0983	0.1325	0.3205	0.0769	0.2051	0.0256	0.0342
Jrue Holiday	2	PG	32.9	mil	75	205	0.501	0.411	0.761	0.5562	0.0304	0.1064	0.2067	0.0486	0.0122	0.0821	0.0608	0.2158	0.4316	0.0608	0.1459	0.0608	0.0821
Patrick Beverley	2	PG	25.4	min	73	180	0.406	0.343	0.722	0.3622	0.0433	0.1220	0.1811	0.0472	0.0354	0.0512	0.1181	0.1220	0.2953	0.0551	0.1654	0.0669	0.0906
Jordan McLaughlin	2	PG	14.5	min	71	185	0.440	0.318	0.750	0.2621	0.0276	0.0828	0.2000	0.0621	0.0138	0.0414	0.0621	0.0966	0.2207	0.0276	0.0966	0.0345	0.0414
Jose Alvarado	2	PG	15.4	no	72	179	0.446	0.291	0.679	0.3961	0.0325	0.0909	0.1818	0.0844	0.0065	0.0455	0.0909	0.1558	0.3506	0.0390	0.1299	0.0455	0.0649
Josh Giddey	2	SG	31.5	okc	80	205	0.419	0.263	0.709	0.3968	0.0571	0.1905	0.2032	0.0286	0.0127	0.1016	0.0508	0.1651	0.3937	0.0317	0.1238	0.0317	0.0476
Theo Maledon	2	PG	17.8	okc	76	175	0.375	0.293	0.790	0.3989	0.0225	0.1236	0.1236	0.0337	0.0112	0.0730	0.0730	0.1292	0.3483	0.0506	0.1629	0.0843	0.1124
Jalen Suggs	2	SG	27.2	orl	76	205	0.361	0.214	0.773	0.4338	0.0184	0.1103	0.1618	0.0441	0.0147	0.1103	0.1103	0.1507	0.4191	0.0331	0.1507	0.0956	0.1250
R.J. Hampton	2	PG	21.9	orl	76	175	0.383	0.350	0.641	0.3470	0.0183	0.1233	0.1142	0.0320	0.0091	0.0639	0.0731	0.1233	0.3242	0.0457	0.1324	0.0548	0.0822
Chris Paul	2	PG	32.9	phx	72	175	0.493	0.317	0.837	0.4468	0.0091	0.1216	0.3283	0.0578	0.0091	0.0729	0.0638	0.1702	0.3435	0.0304	0.0942	0.0790	0.0942
Cameron Payne	2	PG	22.0	phx	73	183	0.409	0.336	0.843	0.4909	0.0182	0.1182	0.2227	0.0318	0.0136	0.0818	0.0955	0.1864	0.4591	0.0545	0.1636	0.0591	0.0682
Dennis Smith Jr.	2	PG	17.3	por	74	205	0.418	0.222	0.656	0.3237	0.0289	0.1040	0.2081	0.0694	0.0173	0.0809	0.0809	0.1214	0.2948	0.0116	0.0405	0.0636	0.0983
Tyrese Haliburton1	2	PG	34.5	sac	77	185	0.457	0.413	0.837	0.4145	0.0232	0.0899	0.2145	0.0493	0.0203	0.0667	0.0406	0.1536	0.3333	0.0580	0.1420	0.0493	0.0580
Davion Mitchell	2	PG	27.7	sac	74	205	0.418	0.316	0.659	0.4152	0.0144	0.0650	0.1516	0.0253	0.0108	0.0542	0.0686	0.1697	0.4043	0.0469	0.1552	0.0253	0.0397
Derrick White1	2	PG	30.3	sa	76	190	0.426	0.314	0.869	0.4752	0.0165	0.0990	0.1848	0.0330	0.0297	0.0594	0.0792	0.1650	0.3828	0.0561	0.1749	0.0924	0.1089
Tre Jones	2	PG	16.6	sa	73	185	0.490	0.196	0.780	0.3614	0.0241	0.1084	0.2048	0.0361	0.0060	0.0422	0.0663	0.1446	0.2952	0.0060	0.0422	0.0602	0.0783
Malachi Flynn	2	PG	12.2	tor	73	175	0.393	0.333	0.625	0.3525	0.0164	0.0984	0.1311	0.0410	0.0082	0.0246	0.0820	0.1311	0.3443	0.0574	0.1639	0.0246	0.0410
Mike Conley	2	PG	28.6	utah	73	175	0.435	0.408	0.796	0.4790	0.0245	0.0839	0.1853	0.0455	0.0105	0.0594	0.0699	0.1678	0.3846	0.0804	0.2028	0.0629	0.0804
Ish Smith1	2	PG	22.0	wsh	72	175	0.457	0.357	0.600	0.3909	0.0227	0.1136	0.2364	0.0455	0.0227	0.0682	0.0727	0.1818	0.4000	0.0227	0.0682	0.0045	0.0091
Raul Neto	2	PG	19.6	wsh	73	180	0.463	0.292	0.769	0.3827	0.0102	0.0867	0.1582	0.0408	0.0000	0.0561	0.0765	0.1480	0.3214	0.0255	0.0867	0.0612	0.0765
Aaron Holiday	2	G	16.2	wsh	72	185	0.467	0.343	0.800	0.3765	0.0123	0.0864	0.1173	0.0370	0.0123	0.0617	0.0926	0.1481	0.3210	0.0370	0.0988	0.0432	0.0556

From the list, choose a player you like from a team that has several of these types of players. They’d be more likely to part ways. Assess the strengths of the pertinent players and propose a trade! How does it look?

Feel free to make the trades as complex as you wish, but try to choose something that the opposing team would agree to.

Defend your proposed trade using the cluster information. You may add in some basketball knowledge if you like.

What do you think of this process? What are the strengths and weaknesses of evaluating a team based on cluster membership?