K-Means Clustering with NBA Data

Author

Caleb Skinner, Tony Munoz, Michael P.B. Gallaugher, Rodney X. Sturdivant

Overview

Cluster analysis is a statistical analysis tool that partitions observations into sub-populations of similar characteristics within the data set. This process can be useful, because similar observations often behave and respond to stimuli in similar ways. Identifying clusters can allow researchers to predict and draw conclusions on the behavior of certain groups. There are many popular topics that use cluster analysis: risk analysis, marketing, real estate, insurance, medical research, and earthquakes.

In this module, we’ll use the clustering of NBA players as an example. Suppose you were an NBA General Manager interested in constructing a high-quality team. The best teams use lots of different kinds of players to achieve their goals. Golden State Warriors Guard Stephen Curry is an incredible shooter and ball-handler, but the Warriors need other kinds of players, too. A team comprised completely of Stephen Curry and his clones would struggle to defend or rebound the ball. The team would also struggle to give each Stephen Curry the playing time and shots that he has come to expect. Instead, General Managers can separate potential players into groups, because it helps them to identify their team needs. This is where cluster analysis proves useful.

For this exercise, imagine that you are the General Manager of the Dallas Mavericks. You are tasked with creating a strong, balanced team. Later in the module, you will have an opportunity to create hypothetical trade scenarios that could benefit the team.

Getting Started

Required Packages

We will be using the following packages in this module. Take the time now to make sure these packages are installed and loaded on your computer.

library("parameters")
library("factoextra")
library("NbClust")
library("cluster")
library("formatR")
library("tidyverse"); theme_set(theme_minimal())
library("ClusterR")
library("mclust")
library("easystats")
library("here")
library("knitr")
library("kableExtra")
library("condformat")
library("formattable")
library("reactablefmtr")
library("scales")
library("plotly")
library("flextable")

The Data

Our data for this exercise comes from the 2021-2022 NBA Season. This season, the Mavericks finished 4th in the Western Conference with 52 wins and 30 losses under coach Jason Kidd. They exceeded expectations and made the Western Conference Finals.

Our data includes 374 players. Each of these 374 players fulfilled our requirements of appearing in at least 25 games and playing an average of at least 12 minutes (a complete game is 48) in those games. Because of midseason trades or acquisitions, some of the players will appear in our data twice. That’s because they fulfilled our playing time requirements for two different teams in the same season. The second iteration of the player will be marked with a 1 following his name (i.e. Smith becomes Smith1). We’ve divided the variables into two data sets.

The first set of variables are focused on determining the influence a player has on the game. Some of these variables are the players’ minutes per game, total games played and started, points and rebounds per game, and field goal attempts per game. This will be helpful in clustering the players into groups of stars, average starters, and reserves. We’ve termed this data set “usage”. Below is a data dictionary for the first set of variables.

Variable

Explanation

Example

Name

nba player's first and last name

Trae Young or Trae Young1

POS

playing position

PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)

Team

abbreviation of city of player's team

atl (Atlanta), bos (Boston), etc.

GP

total games played

46, 70, etc.

GS

total games started

7, 56, etc.

MIN

minutes per game

18.2, 30.2, etc.

PTS

points per game

6.8, 14.9, etc.

AST

assists per game

1.1, 3.5, etc.

TO

turnovers per game

0.8, 1.7, etc.

STL

steals per game

0.5, 1.1, etc.

OR

offensive rebounds per game

0.5, 1.4, etc.

DR

defensive rebounds per game

2.3, 4.1, etc.

BLK

blocks per game

0.2, 0.6, etc.

PF

personal fouls per game

1.5, 2.4, etc.

FGM

field goals made per game

2.6, 5.5, etc.

FGA

field goals attempted per game

5.4, 12.2, etc.

3PM

3-point field goals¬ made per game

0.6, 1.9, etc.

3PA

3-point field goals attempted per game

1.9, 5.2, etc.

FTM

free throws made per game

0.8, 2.2, etc.

FTA

free throws attempted per game

1.1, 2.8, etc.

PER

player efficiency rating metric

11.74, 17.27, etc.

SC-EFF

scoring efficiency

1.162, 1.332, etc.

SH-EFF

shooting efficiency

0.48, 0.56, etc.

And here is a small slice of the usage data set.

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

SH-EFF

Trae Young

PG

atl

76

76

34.9

28.4

9.7

4.0

0.9

0.7

3.1

0.1

1.7

9.4

20.3

3.1

8.0

6.6

7.3

25.48

1.396

0.54

John Collins

PF

atl

54

53

30.8

16.2

1.8

1.1

0.6

1.7

6.1

1.0

3.0

6.3

11.9

1.2

3.3

2.5

3.1

18.75

1.360

0.58

Bogdan Bogdanovic

SG

atl

63

27

29.3

15.1

3.1

1.1

1.1

0.5

3.5

0.2

2.1

5.4

12.6

2.7

7.3

1.5

1.8

15.49

1.196

0.54

De'Andre Hunter

SF

atl

53

52

29.8

13.4

1.3

1.3

0.7

0.5

2.8

0.4

2.9

4.8

10.8

1.4

3.7

2.4

3.1

10.66

1.233

0.51

Kevin Huerter

SG

atl

74

60

29.6

12.1

2.7

1.2

0.7

0.4

3.0

0.4

2.5

4.7

10.3

2.2

5.6

0.6

0.7

11.91

1.174

0.56

The second set of variables are helpful in determining a player’s role or function in the game. Some of these variables are Field Goal Percentage, Height, and Weight. Lots of the common variables have been converted into per minute values in order to isolate their frequency. These players will be divided into sub-groups like scorers, big men, and wings. We’ve termed this data set “role”. Below is a data dictionary for the second set of variables.

Variable

Explanation

Example

Name

nba player's first and last name

Trae Young or Trae Young1

POS

playing position

PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)

Team

abbreviation of city of player's team

atl (Atlanta), bos (Boston), etc.

Height

height in inches

76, 81, etc.

Weight

weight in pounds

200, 234, etc.

PTSPerMin

points per minute

0.356, 0.515, etc.

ASTPerMin

assists per minute

0.055, 0.133, etc.

TOPerMin

turnovers per minute

0.036, 0.065, etc.

STLPerMin

steals per minute

0.023, 0.038, etc.

ORPerMin

offensive rebounds per minute

0.022, 0.066, etc.

DRPerMin

defensive rebounds per minute

0.101, 0.175, etc.

BLKPerMin

blocks per minute

0.009, 0.027, etc.

PFPerMin

fouls per minute

0.064, 0.099, etc.

FGP

field goal percentage

0.417, 0.496, etc.

FGMPerMin

field goals made per minute

0.131, 0.192, etc.

FGAPerMin

field goals attempted per minute

0.284, 0.419, etc.

3PP

3 point percentage

0.306, 0.379, etc.

3PMPerMin

3 point field goals made per minute

0.029, 0.072, etc.

3PAPerMin

3 point field goals attempted per minute

0.094, 0.192, etc.

FTP

free throw percentage

0.709, 0.842, etc.

FTMPerMin

free throws made per minute

0.039, 0.087, etc.

FTAPerMin

free throws attempted per minute

0.053, 0.112, etc.

And here is a small slice of the role data set.

Name

POS

Team

Height

Weight

PTSPerMin

ASTPerMin

TOPerMin

STLPerMin

ORPerMin

DRPerMin

BLKPerMin

PFPerMin

FGP

FGMPerMin

FGAPerMin

3PP

3PMPerMin

3PAPerMin

FTP

FTMPerMin

FTAPerMin

Trae Young

PG

atl

73

180

0.814

0.278

0.115

0.026

0.020

0.089

0.003

0.049

0.460

0.269

0.582

0.382

0.089

0.229

0.904

0.189

0.209

John Collins

PF

atl

81

235

0.526

0.058

0.036

0.019

0.055

0.198

0.032

0.097

0.526

0.205

0.386

0.364

0.039

0.107

0.793

0.081

0.101

Bogdan Bogdanovic

SG

atl

78

220

0.515

0.106

0.038

0.038

0.017

0.119

0.007

0.072

0.431

0.184

0.430

0.368

0.092

0.249

0.843

0.051

0.061

De'Andre Hunter

SF

atl

80

225

0.450

0.044

0.044

0.023

0.017

0.094

0.013

0.097

0.442

0.161

0.362

0.379

0.047

0.124

0.765

0.081

0.104

Kevin Huerter

SG

atl

79

190

0.409

0.091

0.041

0.024

0.014

0.101

0.014

0.084

0.454

0.159

0.348

0.389

0.074

0.189

0.808

0.020

0.024

Part 1: Idea of similarity/distance - Interactive

Below is a set of ten Dallas Maverick Players from 2021-2022 that met our playing-time restrictions. Kristaps Porzingis was traded in the middle of the season, but he still met our playing-time qualifications for the Dallas Mavericks. For this example, we’ve combined a few of the variables from both the usage and role data sets. Consider the players Sterling Brown, Maxi Kleber, Dwight Powell, and Josh Green.

Name

Height

Weight

MIN

PTS

OR

DR

AST

STL

BLK

TO

2PA

2PP

3PA

3PP

3PAPerMin

ORPerMin

Luka Doncic

79

230

35.4

28.4

0.9

8.3

8.7

1.2

0.6

4.5

12.8

0.528

8.8

0.353

0.249

0.025

Kristaps Porzingis

87

240

29.5

19.2

1.9

5.8

2.0

0.7

1.7

1.6

9.9

0.537

5.1

0.283

0.173

0.064

Jalen Brunson

73

190

31.9

16.3

0.5

3.4

4.8

0.8

0.0

1.6

9.6

0.545

3.2

0.373

0.100

0.016

Tim Hardaway Jr.

77

205

29.6

14.2

0.3

3.4

2.2

0.9

0.1

0.8

5.4

0.473

7.2

0.336

0.243

0.010

Dorian Finney-Smith

79

220

33.1

11.0

1.5

3.2

1.9

1.1

0.5

1.0

3.2

0.599

5.4

0.395

0.163

0.045

Dwight Powell

82

240

21.9

8.7

2.1

2.8

1.2

0.5

0.5

0.8

4.4

0.703

0.5

0.351

0.023

0.096

Reggie Bullock

78

205

28.0

8.6

0.5

3.1

1.2

0.6

0.2

0.6

1.6

0.550

5.8

0.360

0.207

0.018

Maxi Kleber

82

240

24.6

7.0

1.2

4.7

1.2

0.5

1.0

0.8

1.7

0.586

4.3

0.325

0.175

0.049

Josh Green

77

200

15.5

4.8

0.8

1.6

1.2

0.7

0.2

0.7

2.7

0.573

1.2

0.359

0.077

0.052

Sterling Brown

77

219

12.8

3.3

0.5

2.5

0.7

0.3

0.1

0.5

1.3

0.492

1.9

0.304

0.148

0.039

Exercise 1

  1. For these four players, compare their available statistics.

  2. Which of the four players are most similar kinds of players? Which variables make them similar?

  3. Which variables do they most differ? Which of the four players are the most “different”? Which variables differentiate them the most? Are they similar in any of the categories?

One common and effective way to compare the similarity of two points (or in this case, players) is the euclidean distance formula. The distance formula is found by the following formula:

\(d = \sqrt{(x_{2} - x_{1})^{2} + (y_{2} - y_{1})^{2}}\)

You can visualize this as drawing the shortest line possible between two points and then measuring it. Right now, our variables are in different units (inches, pounds, points, percentage, etc.), so we’ll standardize (more on this later) each of the variables, so the units are equal. This helps each variable to have equal importance in our distance formula.

Below is a table of the distances between each of the players. Match up the player in the column with the player in the row and you’ll find the distance between them. The smaller the value, the more similar the players are.

               Dwight Powell Maxi Kleber Josh Green
Maxi Kleber         4.269475                       
Josh Green          4.554980    4.270914           
Sterling Brown      5.846063    3.940473   3.102775

Below is a visualization of the distances. As the distances increase, the color changes from red to blue. Players matched with themselves will be dark red, because their distance is 0.

fviz_dist(Distance, gradient = list(low = "indianred3",mid = "white", high = "dodgerblue3"))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Exercise 2

  1. Do the tabulated results agree with your previous assessment?

  2. Which is more accurate: your original assessment or the similarity metric?

Part 2: Performing a Cluster Analysis

Calculating the distance between points is the first step in a distance-based cluster analysis. The players with the smallest distance (or with the most similarity) between them are naturally placed in a cluster together.

How does the clustering actually work? As an illustration, we’ll use a basic plot of the Offensive Rebounds and 3-Point Shooting of our Dallas Mavericks players. We’ve standardized the results by adjusting them to per-minute values.

Exercise 3

  1. What do you notice about the data? How would you group the players?

  2. How would you describe these groupings?

  3. In a cluster analysis, every point needs to belong to a cluster. Do any points not seem to have a cluster?

Cluster analysis is the process of partitioning the data into sub-populations or clusters. This is done so that observations in the same cluster are more similar to each other than observations in a different group. These clusters then can be analyzed.

One common method to divide the data into these clusters is distance based and uses the K-Means Algorithm. The k-means algorithm partitions the data into clusters which can then be analyzed. Furthermore, this is performed in an unsupervized fashion. This means that the clusters are found by the algorithm and not predetermined by the researcher. In the NBA example, we cannot determine our clusters beforehand. The algorithm may confirm our original intuition, but this is not guaranteed.

The K-Means Algorithm assigns the data into clusters so that the sum squared distance between the center (or mean) of the clusters and each observation is minimized. At the end, the variance of the all the points within each cluster is as small as possible. One downside of the K-Means Algorithm is that users must predetermine the number of clusters they’d like to create. This is entered as the parameter, K. Let’s say we want to separate our data into K = 2 clusters. The K-Means algorithm will go through four basic steps:

  1. Randomly select two initial cluster centers.
  2. Assign each observation to the closest center.
  3. Calculate the mean of all the observations within each cluster. These cluster means become the new center of each cluster.
  4. Repeat steps 2-3 until no further changes are made.

As these steps are followed, the clusters will move closer and closer to their final positions. Since the first step is to randomly assign cluster centers, the K-Means approach can occasionally yield different results. It’s worth trying it a few different times with different starting points.

Before you look below, provide your estimation of the two clusters of our Dallas Mavericks players. Where would you anticipate the cluster centers to be located?

The code below runs the k-means algorithm. In the kmeans function, the first argument is the data, the second is the number of clusters to be fit (i.e. \(k\)) and nstart is the number of random starting points to use for the algorithm.

set.seed(321)
dallasKMeans_prep <- dallas %>%
  select(Name, `3PAPerMin`, ORPerMin) %>%
  column_to_rownames(var = "Name")

dallas2Means <- kmeans(dallasKMeans_prep, centers = 2, nstart = 50)

Exercise 4

  1. Is this how you would have grouped the players?

  2. Notice the large points in the middle of each cluster. These are the cluster centers. Are they where you expected?

  3. How do you think the groupings will change with three clusters?

How do you think the groupings will change with three clusters? We can easily tell K-Means to randomly assign three centers, and the process of assigning points to cluster means will continue exactly as before.

set.seed(3)
dallas3Means <- kmeans(dallasKMeans_prep, centers = 3, nstart = 50)

dallas3fviz <- fviz_cluster(dallas3Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 3 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas3fviz

Or four clusters?

set.seed(22329)
dallas4Means <- kmeans(dallasKMeans_prep, centers = 4, nstart = 50)

dallas4fviz <- fviz_cluster(dallas4Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 4 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas4fviz

Exercise 5

  1. What happens to Dwight Powell when we increase $k$ to 4?

  2. Would Dwight be considered an outlier? Why? Is this helpful from a clustering perspective?

Now consider five clusters.

set.seed(102)
dallas5Means <- kmeans(dallasKMeans_prep, centers = 5, nstart = 50)

dallas5fviz <- fviz_cluster(dallas5Means, dallasKMeans_prep,
                            show.clust.cent = TRUE, stand = FALSE,
                            labelsize = 7, pointsize = 1,
                            main = "Mavericks K = 5 Clusters",
                            xlab = "3 Point Attempts Per Minute",
                            ylab = "Offensive Rebounds Per Minute")
dallas5fviz

At some point, the power of clustering the points begins to fade. Does Dwight Powell deserve to be in a cluster of his own? Possibly. Does Reggie Bullock? Definitely not.

Exercise 6

  1. Which of the four values of K did you find most useful or accurate?

  2. Were there ever too few or too many clusters?

Part 4: Choosing the Number of Clusters

So, how can we choose the optimal number of clusters?

It’s helpful to evaluate the effectiveness of the clusters for each value K. There are plenty of ways to test this effectiveness, but we’ll walk through a common example called the Elbow Method. The Elbow Method totals up the distance between the centers of each cluster and their observations. This is called the Total Within Summed Squares (TWSS). As K increases and more clusters are added to the model, the sum squared distance will decrease. Eventually, the value of each additional cluster diminishes. The Elbow Method plots the results, and the user can look for a point when increasing the number of clusters no longer proves useful. Often, this point looks like an Elbow.

fviz_nbclust(dallasKMeans_prep, kmeans, method = "wss", k.max = 9) +
  theme_minimal() +
  labs(title = "The Elbow Method")

The graph demonstrates that the value of each additional cluster decreases as more clusters are added. The bends in the graph indicate that clusters beyond four have little value. Despite being common, the Elbow Method is often ambiguous and difficult to interpret. Look for the bend in the Elbow Plot. K = 2, K = 3, and K = 4 all seem like reasonable conclusions.

The Elbow plot is just one test to determine the optimal number of clusters. Two other popular methods are the Average Silhouette Method and the Gap Statistic Method. In all, there are dozens of methods to determine the ideal number of clusters and they often disagree. We’ll take a consensus of 27 methods and proceed from there.

dallasClust <- n_clusters(dallasKMeans_prep,
                          package = c("easystats", "NbClust"),
                          standardize = FALSE, n_max = 5)

plot(dallasClust)

The tests give varied estimates for the optimal clusters, but it is up to the user to decide how many clusters you will include in your K-Mean Algorithm. It’s common practice to choose several and compare the results of each.

From there, we would conduct our analysis of each cluster and examine the results.

After the clustering is completed, how can we analyze our clustering solution?

We want to reduce the Total Within Summed Squares (TWSS) or distance from each observation to its cluster mean, but we also want to minimize the total number of clusters used.

Two helpful measurements to summarize these preferences for our clusters are intra-class similarity and inter-class similarity.

Intra-class similarity tests the relationship between observations of the same cluster. We want this similarity to be high. We want all the observations in a cluster to exhibit similar features.

Inter-class similarity tests the relationship between different clusters. We want this relationship to be low. Ideally, each cluster is distinct and the observations within can be clearly assigned to a cluster.

As we increase the number of clusters, K. The intra-class similarity will increase, because observations will be assigned to smaller clusters that a more representative. However, the inter-class similarity will also increase, because the cluster centers are now closer together. This is why it is impractical to choose a large value for K.

Recall our clustering for the Dallas Mavericks players.

dallas2fviz
dallas3fviz +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
dallas4fviz +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())

Exercise 7

  1. Which value of K has the highest intra-class similarity?

  2. Which cluster specifically?

  3. Which value of K has the highest inter-class similarity?

Part 5: A Larger Dataset

Let’s focus now on our larger data set with many more variables and observations. It seems like it’d be more complicated, but the process is almost exactly the same. One important distinction to remember is that the large number of dimensions make the data difficult to visualize. There are different methods that aid in this visualization. We’ll walk you through the usage data set and demonstrate appropriate analysis, and then allow you to work through the role data set.

Remember the usage data set? It contains variables aimed at categorizing the workload and skill of the players. We hope to divide players into sub-groups like stars and bench players.

It is very important that we standardize the data first. Lots of our variables have different units. Games played and Blocks per game are hard to compare without scaling. Without standardizing, the large values- like Games Started or Games Played- will exert too much influence on the data. Now, each value is described in relation to the other observations. After standardizing, Trae Young’s assist total is 3.656, so we know that he has a lot more assists than the average player in our data set. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

SH-EFF

Trae Young

PG

atl

1.212

1.690

1.478

2.819

3.656

3.186

0.346

-0.422

-0.191

-0.961

-0.433

2.418

2.446

2.049

1.883

3.457

2.948

2.481

0.910

0.085

John Collins

PF

atl

-0.221

0.820

0.903

0.802

-0.386

-0.291

-0.479

0.857

1.514

1.296

1.704

0.984

0.623

-0.071

-0.116

0.540

0.502

0.924

0.676

0.797

Bogdan Bogdanovic

SG

atl

0.365

-0.163

0.692

0.620

0.279

-0.291

0.896

-0.678

0.036

-0.710

0.224

0.568

0.775

1.603

1.585

-0.172

-0.255

0.170

-0.389

0.085

De'Andre Hunter

SF

atl

-0.286

0.782

0.762

0.339

-0.642

-0.052

-0.204

-0.678

-0.362

-0.209

1.539

0.290

0.385

0.152

0.054

0.469

0.502

-0.948

-0.149

-0.449

Kevin Huerter

SG

atl

1.082

1.085

0.734

0.124

0.074

-0.172

-0.204

-0.806

-0.248

-0.209

0.882

0.244

0.276

1.045

0.862

-0.812

-0.895

-0.658

-0.532

0.441

Let’s begin by taking a look at the Elbow plot of the usage dataset.

usage_rm <- usage %>%
  select(-Name, -POS, -Team) %>%
  mutate(across(where(is.numeric), standardize))

fviz_nbclust(usage_rm, kmeans, method = "wss", k.max = 24) +
  theme_minimal() +
  labs(title = "The Elbow Method")

The Elbow plot shows that the algorithm experiences diminishing returns after K = 2 and K = 3. From the Elbow Plot, we would expect that the consensus lies somewhere between 2 and 5 clusters. Now consider the multiple methods for the selection of $k$.

The tests favor three clusters. Some tests also prefer two and four clusters, so those models are worth a look.

set.seed(121)
usage2Means <- kmeans(usageKMeans_prep, centers = 2, nstart = 50)
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)
set.seed(1210)
usage4Means <- kmeans(usageKMeans_prep, centers = 4, nstart = 50)

K = 2 Clusters

Let’s start simple and begin with K = 2 clusters.

But before we begin, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering. If some have little or no influence, we can simplify our analysis by removing them.

The visualization below demonstrates the differences between our two clusters. The variables that have large differences are important in the clustering assignment. They greatly influence the assignment of an observation.

as_tibble(usage2Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
  group_by(variable) %>%
  summarise(Influence = abs(mean(value))) %>%
  mutate(
    variable = factor(variable, levels = usage_levels)
  ) %>% 
  ggplot(aes(x = variable, y = Influence)) +
  geom_bar(stat = "identity", fill = "cadetblue3") +
  labs(title = "Influence on Cluster Assignment", x = "", y = "") +
  theme(axis.text.y = element_blank(),
        legend.position = "none",
        axis.text.x = element_text(angle = -45, size = 9))

This type of exercise is essential for clustering analysis, because it allows one to see which variables are important to consider when classifying an observation.

This visualization scales the centers of the variables for each cluster and contrasts them. Variables with large positive or negative values have a large influence on the clustering. These variables help differentiate the cluster. Variables with an influence close to 0 have less importance.

We see a great diversity in the variables that possess significant influence on the clustering.

Exercise 8

  1. Which variables seem to contribute the most to the clustering result?

  2. Which variables contribute the least to the clustering result?

Scoring Efficiency and Shooting Efficiency both lack influence. Games Played, Offensive Rebounds, and Blocks all also don’t contribute much to our clustering. We chose to remove Shooting Efficiency and keep the other four, but we easily could have removed them from our analysis.

Note for Reviewer. Removing the five variables causes a slight shift in the cluster assignment. This changes some of the analysis and points I was making on the outliers, and it makes comparison between K = 2 and K = 3 more difficult. We don’t remove any of the variables when K = 3. Still, it could make things confusing to not remove variables with very little influence. I’m open to suggestions on what to do here.

set.seed(121)
usage2Means <- usageKMeans_prep %>%
  select(-`SH-EFF`) %>%
  kmeans(centers = 2, nstart = 50)

usage2 <- usage %>% select(-`SH-EFF`)
usage_rm2 <- usage_rm %>% select(-`SH-EFF`)

Now that we’ve removed some variables. Let’s see how many observations are within each cluster.

Cluster

Size

1

119

2

255

The clusters are not identical in size, and it’s different enough that we should keep an eye on it. It’s important to verify that each of the clusters contain a significant number of observations. Like we saw with Dwight Powell earlier, sometimes small clusters can tell us valuable information about the observations they contain.

The K-Means Algorithm will assign each observation a cluster and print out descriptive statistics of each cluster. This can give us a good idea of what makes up each cluster. We went back and unstandardized the data.

usage2centers <- as_tibble(usage2Means$cluster) %>%
  mutate(Name = usage$Name) %>%
  rename(Clusters = value) %>% left_join(usage2, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(across(where(is.numeric), round, digits = 3))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
usage2centers %>% flextable() %>% align(align = "center", part = "all") %>%
  width(j = c(2:15), width = .5)

Clusters

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

1

60.689

56.706

32.572

18.656

4.417

2.266

1.035

1.141

4.845

0.582

2.337

6.773

14.592

1.926

5.386

3.178

3.969

17.824

1.281

2

55.847

19.455

20.540

7.937

1.685

0.913

0.652

0.978

2.779

0.437

1.789

2.958

6.429

0.954

2.728

1.071

1.429

13.324

1.244

Generally, it looks like cluster 1 contains starter caliber players and cluster 2 includes the bench players. This helps to explain why cluster 1 is a bit smaller than cluster 2.

Now, let’s look at the clusters graphically. This can help us to see how different the clusters really are from each other. The graph is created by combining the values of all the variables in a visually understandable way. This is through a process called Principle Component Analysis (PCA). Link to more defined explanation of PCA.

usage2fviz <- fviz_cluster(usage2Means, usageKMeans_prep,
                           geom = "point",
                           show.clust.cent = TRUE, stand = FALSE,
                           pointsize = 1,
                           main = "Usage K = 2 Clusters")
usage2fviz

Many of the observations in both clusters lie close to the border. This indicates that the division between the clusters was close and there may be some observations that could have been placed in either cluster. The centers are fairly close and located at about (-3,0) and (2,0).

There are several large outliers in both clusters, but especially in the lower portion of the visualization in both clusters and the left portion cluster 1.

Prototypes

To help us understand the clusters better, let’s look at some players that fall very close to the cluster center. We’ll call the players that represent the cluster well prototype players.

usage2Means_scale <- as_tibble(usage2Means$centers) %>%
  mutate(cluster = 1:2)


usage_fitted2Means <- usage2Means$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(usage2Means_scale) %>% select(-cluster)
Joining with `by = join_by(cluster)`
distances <- sqrt(rowSums((usage_rm2 - usage_fitted2Means)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = usage$Name,
    Cluster = usage2Means$cluster
  )
dist_slice1 <- distances %>%
  arrange(distance) %>%
  select(Name, Cluster, distance) %>%
  filter(Cluster == 1) %>% slice(1:3)

dist_slice1 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name

Cluster

distance

Khris Middleton

1

1.8408

Miles Bridges

1

1.9950

Gordon Hayward

1

2.2011

Exercise 9

  1. Which player is closest to the center for Cluster 1?

  2. Are there other players who are close to the center for Cluster 1 that could also be considered prototypes?

  3. Look at the prototype players’ statistics to see if we characterize Cluster 1.

This would be a good opportunity to play highlights of one of the players or show a picture or something to keep people engaged.

prototype_k2c1 <- dist_slice1 %>% select(Name) %>% left_join(usage2)
Joining with `by = join_by(Name)`
prototype_k2c1 %>% flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

Khris Middleton

SF

mil

66

66

32.4

20.1

5.4

2.9

1.2

0.6

4.8

0.3

2.4

6.8

15.5

2.5

6.6

3.9

4.4

18.19

1.298

Miles Bridges

SF

cha

80

80

35.5

20.2

3.8

1.9

0.9

1.1

5.9

0.8

2.4

7.5

15.2

1.9

5.8

3.3

4.2

17.97

1.329

Gordon Hayward

SF

cha

49

48

31.9

15.9

3.6

1.7

1.0

0.8

3.8

0.4

1.7

5.8

12.6

1.8

4.5

2.6

3.0

15.11

1.261

Consider Khris Middleton, Miles Bridges, and Gordon Hayward. The three players all play a similar position; one that allows them to contribute in all areas of the game. There was significant variety in the number of Games Played, but they Started in each game and received a lot of playing time. They all played over 30 Minutes per game and scored about 20 Points a game. Their Rebound, Assist, Block, and Turnover totals vary a little bit, but they are all fairly high. They all took and made roughly the same number of shots per game (15.2-15.9 FGA) and (6.8-7.5 FGM).

Let’s move on to cluster 2. First, notice how much smaller the distances are from the cluster 2 center. More observations lie close to cluster 2’s center than cluster 1. This is not entirely surprising, as there are almost 100 more players in cluster 2 than 1.

Again consider potential prototypes for the second cluster.

dist_slice2 <- distances %>% arrange(distance) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)

dist_slice2 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name

Cluster

distance

Blake Griffin

2

1.1953

Torrey Craig

2

1.2661

Rudy Gay

2

1.2779

Blake Griffin is our prototype player of cluster 2. Torrey Craig and Rudy Gay are also strong representative of cluster 2 as well.

prototype_k2c2 <- dist_slice2 %>% select(Name) %>% left_join(usage2)
Joining with `by = join_by(Name)`
prototype_k2c2 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

Blake Griffin

PF

bkn

56

24

17.1

6.4

1.9

0.6

0.5

1.1

3.0

0.3

1.7

2.4

5.6

0.7

2.6

1.0

1.4

13.77

1.147

Torrey Craig

SF

ind

51

14

20.3

6.5

1.1

0.8

0.5

1.2

2.7

0.4

1.9

2.5

5.5

0.9

2.7

0.5

0.7

10.82

1.171

Rudy Gay

SF

utah

55

1

18.9

8.1

1.0

0.9

0.5

1.0

3.4

0.3

1.7

2.9

6.9

1.3

3.7

1.1

1.4

13.06

1.177

Once again, the prototypes look like an average NBA player. They each played around 55 Games and Started in very few of them. They played about 17.1-20.3 Minutes a game and scored from 6.4-8.1 Points a game. Their Rebound, Assist, Steal, Block, Turnover, and Foul values are fairly low and generally close together. They also don’t take as many shots as cluster 1 - only about 6 Field Goal Attempts per game.

Outliers

Now, let’s look through some of the players that fall farthest from the center of their cluster. These players are cluster outliers. In these cases, the clustering least represents the observation. These players are very different from the center. It can be helpful to identify and explain outliers by comparing them to our prototype players. How do they differ? What attributes led to their classification?

Is there a way to only label a few of the points in the visualization

dist_slice3 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 1) %>% slice(1:2,5)

dist_slice3 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name

Cluster

distance

Rudy Gobert

1

9.1597

Joel Embiid

1

8.8545

Myles Turner

1

6.6678

outlier_k2c1 <- dist_slice3 %>% select(Name) %>%
  add_row(Name = "Khris Middleton") %>% add_row(Name = "Blake Griffin") %>% # want to add centers of clusters for reference
  left_join(usage2) %>% arrange(desc(DR))
Joining with `by = join_by(Name)`
outlier_k2c1 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

Rudy Gobert

C

utah

66

66

32.1

15.6

1.1

1.8

0.7

3.7

11.0

2.1

2.7

5.5

7.7

0.0

0.1

4.6

6.7

24.76

2.022

Joel Embiid

C

phi

68

68

33.8

30.6

4.2

3.1

1.1

2.1

9.6

1.5

2.7

9.8

19.6

1.4

3.7

9.6

11.8

31.24

1.558

Myles Turner

C

ind

42

42

29.4

12.9

1.0

1.3

0.7

1.5

5.5

2.8

2.8

4.8

9.4

1.5

4.4

1.9

2.5

17.45

1.374

Khris Middleton

SF

mil

66

66

32.4

20.1

5.4

2.9

1.2

0.6

4.8

0.3

2.4

6.8

15.5

2.5

6.6

3.9

4.4

18.19

1.298

Blake Griffin

PF

bkn

56

24

17.1

6.4

1.9

0.6

0.5

1.1

3.0

0.3

1.7

2.4

5.6

0.7

2.6

1.0

1.4

13.77

1.147

Sometimes, you’ll need to do some digging on the outliers. We chose to show you Khris Middleton and Blake Griffin’s characteristics again for comparison. Joel Embiid, Giannis Antetokounmpo, and Myles Turner represent two very different kinds of outliers. Embiid and Giannis are superstars. They finished second and third in the MVP voting in the 2021-2022 season. They are very far from the prototype of cluster 1, but they are even further from the prototype of cluster 2. These are the points near (-10, -5) in the visualization.

Myles Turner, however, possesses some attributes that could be classified as cluster 1 and cluster 2. He played lots of Minutes, Started most games, and had strong Rebounding values. However, his shooting numbers fall right between the clusters, and he doesn’t tally very many Points, Assists, Steals, or Turnovers. This point is likely the (-5, -9) outlier in the visualization. He is a borderline case. Is there a more statistical word for this?

dist_slice4 <- distances %>% arrange(desc(distance)) %>% select(Name, Cluster, distance) %>% filter(Cluster == 2) %>% slice(1:3)

dist_slice4 %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3)

Name

Cluster

distance

Robert Williams III

2

7.4501

Mitchell Robinson

2

7.3269

Clint Capela

2

6.4898

outlier_k2c2 <- dist_slice4 %>% select(Name) %>% left_join(usage2)
Joining with `by = join_by(Name)`
outlier_k2c2 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:15), width = .5)

Name

POS

Team

GP

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

Robert Williams III

C

bos

61

61

29.6

10.0

2.0

1.0

0.9

3.9

5.7

2.2

2.2

4.4

6.0

0

0

1.1

1.5

22.10

1.649

Mitchell Robinson

C

ny

72

62

25.7

8.5

0.5

0.8

0.8

4.1

4.5

1.8

2.7

3.6

4.8

0

0

1.2

2.5

20.78

1.778

Clint Capela

C

atl

74

73

27.6

11.1

1.2

0.6

0.7

3.8

8.1

1.3

2.2

5.0

8.2

0

0

1.1

2.3

21.43

1.358

These cluster 2 outliers are all similar players. Robert Williams III, Mitchell Robinson, and Clint Capela are all big men. Like Myles Turner, they are players that play a lot of Games and Minutes, get lots of Rebounds and Blocks, but don’t shoot very much. Our data emphasizes shooting a lot and perhaps this leaves players like these without an appropriate cluster. They are borderline candidates that perhaps would benefit from another cluster.

# This below is from code that Dr. Sturdivant sent me. The cluster_analysis function produces a different size clusters than we got from the K-Means function
# set.seed(121)
# res_2means <- cluster_analysis(usage_rm,
#                                n = 2,
#                                method = "kmeans")
# # res_2means
# summary(res_2means)
# 
# # predict(res_2means) # get clusters
# plot(res_2means)

Now, let’s analyze the strength of K = 2 clusters. For reference, we’ve repeated the visualization below.

usage2fviz

The two clusters possess strong inter-class differences. For only two clusters, cluster 1 and cluster 2 are fairly distinct. The centers are far apart and demonstrate two different classifications of players. Cluster 1 is clearly a sub-population of starting, high-volume players and cluster 2 is a sub-population of bench players. Still, we’ve analyzed the outliers and found some players that could fall in either cluster. There could be some confusion for players like Robert Williams and Myles Turner. These players seem more similar to each other than most of the players in their own cluster. These outliers fall around (-2, -7). Check the visualizations again to see the cluster of players near there.

The intra-class similarity is fairly low. The clusters are large and have many outliers in each of the directions. Players like Giannis Antetokounmpo, Khris Middleton, and Myles Turner have little in common, but they are all grouped into cluster 1. Yet, most of cluster 1 produce larger values and most of cluster 2 have smaller numbers.

K = 3 Clusters - Interactive

# keep in case of reset
set.seed(4)
usage3Means <- kmeans(usageKMeans_prep, centers = 3, nstart = 50)

Now, let’s look at the consensus tests’ most popular number of clusters: K = 3. Here, we’d like you to produce your own analysis of the results. If you need help, look back at the K = 2 example.

As you progress, fill out this table with descriptors of the three clusters. This will be helpful for you as you try to identify their distinctions.

stu_table <- tibble(
  Cluster = 1:3,
  Description = "")

stu_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster

Description

1

2

3

Once again, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering.

This visualization plots the centers for each variable in a cluster. At a glance, this helps us to understand the characteristics of each cluster. We can see that cluster 2, for example, has high offensive rebounds and blocks per game, but low 3 point attempts and 3 point makes.

It can also tell us what variables are unimportant. If a variable has the similar mean throughout all three clusters, then the variable does not help us to distinguish between the clusters. If a variable has a large positive value in one cluster and a large negative value in another, then that variable is very useful for classifying our data.

# creates a dataset of each variable and the standardized center and graphs it
as_tibble(usage3Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(GP:`SH-EFF`), names_to = "variable") %>%
  mutate(variable = factor(variable, usage_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  facet_grid(rows = vars(cluster)) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0, size = 10)) +
  scale_y_continuous(position = "right") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.y = element_blank(),
        legend.position = "none")

Before you analyze, remember that variables with a strong negative value still have large influence. It’s just a negative association with a variable instead of a positive association.

What do you notice about the variables? Which kinds of variables possess significant influence? Some variables have a strong influence in one cluster, but a weak influence in another cluster. Why is this?

After analyzing, would you choose to remove any variables from the data?

Is there a better way to look at the variables and remove the less influential ones?

We chose to remove the Games Played variable, because its influence was close to 0 in all three clusters. All of the other variables had a large effect in some category.

# reproducing K = 3 means without insignificant variables
set.seed(4)
usage3Means <- usageKMeans_prep %>%
  select(-GP) %>%
  kmeans(centers = 3, nstart = 50)

# creating a second usage without those variables so i don't have to reproduce it 800 million times.
usage3 <- usage %>% select(-GP)
usage_rm3 <- usage_rm %>% select(-GP)

Now that we’ve removed some variables. Let’s see how many observations are within each cluster.

usage3Means$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster

Size

1

102

2

61

3

211

What do you notice about the cluster size? What could this tell us about the clusters?

The clusters are not identical in size, but the clusters are each large enough that there is no reason to be concerned.

# un-standardizing and calculating the mean
usage3centers <- as_tibble(usage3Means$cluster) %>%
  mutate(Name = usage$Name) %>%
  rename(Clusters = value) %>% left_join(usage3, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(across(where(is.numeric), round, digits = 3))

usage3centers %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = c(2:15), width = .5)

Clusters

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

SH-EFF

1

57.539

33.089

19.346

4.773

2.355

1.071

0.983

4.652

0.513

2.289

6.966

15.224

2.066

5.755

3.342

4.123

17.816

1.267

0.525

2

36.295

22.354

9.580

1.551

1.175

0.649

2.220

4.575

0.954

2.438

3.818

6.684

0.351

1.034

1.611

2.315

18.374

1.455

0.604

3

17.185

20.735

7.992

1.773

0.902

0.667

0.709

2.519

0.333

1.669

2.924

6.708

1.139

3.254

1.005

1.304

12.231

1.193

0.520

What do you notice about the cluster means? Without looking any further, how would you describe the three clusters? Jot down some notes in your table.

Now, let’s look at the clusters graphically.

usage3fviz <- fviz_cluster(usage3Means, usageKMeans_prep,
                           geom = "point",
                           show.clust.cent = TRUE, stand = FALSE,
                           pointsize = 1,
                           main = "Usage K = 3 Clusters")
usage3fviz

What do you notice about the visualization? Are there a lot of observations that reside on the border? Where are the centers and outliers of each cluster?

Compare the new visualization with the K = 2 visualization. Where did the third cluster come from? What kinds of players?

If you were to create a fourth cluster, what points would you group together?

Let’s look at our prototype and outlier players. We’ve compiled them all into a table for you to compare and contrast.

# standardizing the distances between the players
usage3Means_scale <- as_tibble(usage3Means$centers) %>%
  mutate(cluster = 1:3)

# creating appropriate tibble for distance formula
usage_fitted3Means <- usage3Means$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(usage3Means_scale) %>% select(-cluster)
Joining with `by = join_by(cluster)`
# distance from cluster center
distances <- sqrt(rowSums((usage_rm3 - usage_fitted3Means)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = usage$Name,
    Cluster = usage3Means$cluster)

# creating a master document with all of the prototypes and all of the outliers.
master_distances <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(usage3) %>% arrange(Cluster, desc(Category))
Joining with `by = join_by(Name)`
master_distances %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = 2, width = .8) %>%
  width(j = c(4:15), width = .5)

Name

Cluster

Category

POS

Team

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

SH-EFF

Miles Bridges

1

Prototype

SF

cha

80

35.5

20.2

3.8

1.9

0.9

1.1

5.9

0.8

2.4

7.5

15.2

1.9

5.8

3.3

4.2

17.97

1.329

0.55

Malcolm Brogdon

1

Prototype

PG

ind

36

33.5

19.1

5.9

2.1

0.8

0.9

4.2

0.4

2.0

6.8

15.1

1.6

5.2

4.0

4.6

18.10

1.265

0.50

Khris Middleton

1

Prototype

SF

mil

66

32.4

20.1

5.4

2.9

1.2

0.6

4.8

0.3

2.4

6.8

15.5

2.5

6.6

3.9

4.4

18.19

1.298

0.52

Nikola Jokic

1

Outlier

C

den

74

33.5

27.1

7.9

3.8

1.5

2.8

11.0

0.9

2.6

10.3

17.7

1.3

3.9

5.1

6.3

32.94

1.529

0.62

Giannis Antetokounmpo

1

Outlier

PF

mil

67

32.9

29.9

5.8

3.3

1.1

2.0

9.6

1.4

3.2

10.3

18.6

1.1

3.6

8.3

11.4

32.12

1.608

0.58

Joel Embiid

1

Outlier

C

phi

68

33.8

30.6

4.2

3.1

1.1

2.1

9.6

1.5

2.7

9.8

19.6

1.4

3.7

9.6

11.8

31.24

1.558

0.53

Nic Claxton

2

Prototype

PF

bkn

19

20.7

8.7

0.9

0.8

0.5

1.9

3.7

1.1

2.3

3.8

5.6

0.0

0.0

1.1

2.0

18.66

1.553

0.67

Isaiah Roby

2

Prototype

PF

okc

28

21.1

10.1

1.6

1.0

0.8

1.7

3.2

0.8

2.4

3.7

7.2

1.0

2.2

1.7

2.6

18.35

1.406

0.58

Richaun Holmes

2

Prototype

C

sac

37

23.9

10.4

1.1

1.2

0.4

2.1

5.0

0.9

2.8

4.4

6.7

0.0

0.1

1.6

2.0

17.80

1.560

0.66

Robert Williams III

2

Outlier

C

bos

61

29.6

10.0

2.0

1.0

0.9

3.9

5.7

2.2

2.2

4.4

6.0

0.0

0.0

1.1

1.5

22.10

1.649

0.74

Myles Turner

2

Outlier

C

ind

42

29.4

12.9

1.0

1.3

0.7

1.5

5.5

2.8

2.8

4.8

9.4

1.5

4.4

1.9

2.5

17.45

1.374

0.59

Rudy Gobert

2

Outlier

C

utah

66

32.1

15.6

1.1

1.8

0.7

3.7

11.0

2.1

2.7

5.5

7.7

0.0

0.1

4.6

6.7

24.76

2.022

0.71

Damion Lee

3

Prototype

SG

gs

5

20.0

7.4

1.0

0.6

0.6

0.4

2.8

0.1

1.5

2.7

6.1

1.0

3.0

1.0

1.2

10.90

1.219

0.52

Ziaire Williams

3

Prototype

SG

mem

31

21.7

8.1

1.0

0.7

0.6

0.4

1.7

0.2

1.8

3.1

6.8

1.2

3.9

0.7

0.9

9.70

1.182

0.54

Rudy Gay

3

Prototype

SF

utah

1

18.9

8.1

1.0

0.9

0.5

1.0

3.4

0.3

1.7

2.9

6.9

1.3

3.7

1.1

1.4

13.06

1.177

0.51

Tomas Satoransky

3

Outlier

SG

no

3

15.0

2.8

2.4

0.7

0.4

0.6

1.4

0.0

1.0

1.0

3.3

0.2

1.0

0.6

0.8

6.51

0.822

0.32

Robert Covington

3

Outlier

PF

por

40

29.8

7.6

1.4

1.2

1.5

0.9

4.9

1.3

2.8

2.7

7.0

1.6

4.8

0.6

0.8

9.98

1.086

0.50

Buddy Hield1

3

Outlier

SG

sac

6

28.6

14.4

1.9

1.6

0.9

0.8

3.2

0.3

2.1

4.8

12.6

3.3

9.0

1.5

1.7

11.96

1.143

0.51

Here is a smaller table that may help you compare the players more easily.

master_distances1 <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 2 | proto_rank < 2) %>%
  mutate(
    Category = if_else(proto_rank < 2, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(usage3) %>% arrange(desc(Category), Cluster)
Joining with `by = join_by(Name)`
master_distances1 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = 2, width = .8) %>%
  width(j = c(4:15), width = .5)

Name

Cluster

Category

POS

Team

GS

MIN

PTS

AST

TO

STL

OR

DR

BLK

PF

FGM

FGA

3PM

3PA

FTM

FTA

PER

SC-EFF

SH-EFF

Khris Middleton

1

Prototype

SF

mil

66

32.4

20.1

5.4

2.9

1.2

0.6

4.8

0.3

2.4

6.8

15.5

2.5

6.6

3.9

4.4

18.19

1.298

0.52

Isaiah Roby

2

Prototype

PF

okc

28

21.1

10.1

1.6

1.0

0.8

1.7

3.2

0.8

2.4

3.7

7.2

1.0

2.2

1.7

2.6

18.35

1.406

0.58

Damion Lee

3

Prototype

SG

gs

5

20.0

7.4

1.0

0.6

0.6

0.4

2.8

0.1

1.5

2.7

6.1

1.0

3.0

1.0

1.2

10.90

1.219

0.52

Joel Embiid

1

Outlier

C

phi

68

33.8

30.6

4.2

3.1

1.1

2.1

9.6

1.5

2.7

9.8

19.6

1.4

3.7

9.6

11.8

31.24

1.558

0.53

Rudy Gobert

2

Outlier

C

utah

66

32.1

15.6

1.1

1.8

0.7

3.7

11.0

2.1

2.7

5.5

7.7

0.0

0.1

4.6

6.7

24.76

2.022

0.71

Tomas Satoransky

3

Outlier

SG

no

3

15.0

2.8

2.4

0.7

0.4

0.6

1.4

0.0

1.0

1.0

3.3

0.2

1.0

0.6

0.8

6.51

0.822

0.32

Use the above tables to summarize each of the 6 categories. What kind of players belong in each category? Is there a lot of variation within the prototypes? Is there a lot of variation within the outliers? Which of the outliers are closest to a different cluster? Would you reclassify any of the outliers?

After looking through the clusters, why do you think cluster 2 is so much smaller?

Let’s analyze the overall strength of K = 3 clusters. How does the intra-class similarity compare with K = 2? The inter-class similarity?

# usage3fviz

Comparing K = 2 to K = 3 - Mix

Often, it is interesting to compare the cluster results. Here, we tabulated the cluster assignments between K = 2 and K = 3. This can help us to see how the clustering with K = 2 overlaps with K = 3.

# creating a tibble of the cluster of each player for each K
clusters <- tibble(
  player = usage$Name,
  Cluster = usage2Means$cluster,
  clus3 = usage3Means$cluster,
  clus4 = usage4Means$cluster
)

# tabulating K = 2 and K = 3 clusters
compare_K2K3 <- with(clusters, table(Cluster, clus3)) %>%
  as_tibble() %>%
  pivot_wider(names_from = clus3, values_from = n)

# printing table using kable
compare_K2K3 %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster

1

2

3

1

102

13

4

2

0

48

207

What do you notice about the clustering distribution?

We can see that most players in cluster 1 from K = 2 stayed in cluster 1 when K = 3. We identified both of these clusters as the “starters,” so this makes a lot of intuitive sense. Most of cluster 2 from K = 2 moved into cluster 3 when K = 3. The interesting transition comes with the middle cluster of K = 3. This cluster is full of big men that don’t score a lot. They came from both cluster 1 and cluster 2 of K = 2. We saw this in our outlier analysis earlier.

Exercise 10

What are the benefits and costs of both K = 2 and K = 3? Which would you choose?

Part 6: Role Data Set

Now we move on to a second data set and we want to give you a lot more autonomy to test different clusters or outliers yourself. The data set is different, but the process is almost exactly the same. If you have questions, we’ll give you hints or you can look back to the usage data set for a clear example.

Remember the role data set? It contains variables aimed at categorizing the function and specific characteristics of the players. We hope to divide players into sub-groups like scorers, 3-point shooters, and rebounders.

Even though most of our data has been set to adjusted “per minute” quantities. It is still very important that we standardize the data first. Otherwise common values like points per minute will outweigh the effect of less common characteristics like blocks per minute. Now each variable is on the same scale. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.

We could also give a short mini lesson on the importance of standardizing using games started and blocks or something like that.

# initializing our datasets a second time in case student decides to remove a variable.
# For some reason, when I round to 3 digits, the elbow plot no longer suggests K = 7. This is very surprising. So I've decided to keep it rounding to 4 digits, because I have done so much work for K = 7.
role <- nba %>%
  select(Name, POS, Team, Height, Weight, PTSPerMin, ASTPerMin, TOPerMin, STLPerMin, ORPerMin, DRPerMin, BLKPerMin, PFPerMin, FGP, FGMPerMin, FGAPerMin, `3PP`, `3PMPerMin`, `3PAPerMin`, FTP, FTMPerMin, FTAPerMin)

# standardizing the data for KMeans
roleKMeans_prep <- role %>%
  mutate(across(where(is.numeric), standardize))

# displaying the standardized data for student
roleKMeans_prep %>%
  slice(1:5) %>%
  mutate(across(where(is.numeric), round, digits = 3)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:5), width = .6) %>%
  width(j = c(6:12), width = .95)

Name

POS

Team

Height

Weight

PTSPerMin

ASTPerMin

TOPerMin

STLPerMin

ORPerMin

DRPerMin

BLKPerMin

PFPerMin

FGP

FGMPerMin

FGAPerMin

3PP

3PMPerMin

3PAPerMin

FTP

FTMPerMin

FTAPerMin

Trae Young

PG

atl

-1.655

-1.496

2.824

3.171

2.751

-0.499

-0.698

-0.944

-1.083

-1.288

-0.085

2.237

2.300

0.525

1.354

1.164

1.376

3.112

2.499

John Collins

PF

atl

0.842

0.774

0.620

-0.708

-0.773

-1.019

0.269

1.024

0.734

0.474

0.820

0.857

0.313

0.357

-0.356

-0.443

0.259

0.360

0.275

Bogdan Bogdanovic

SG

atl

-0.094

0.155

0.539

0.129

-0.692

0.469

-0.780

-0.392

-0.840

-0.457

-0.483

0.426

0.757

0.394

1.468

1.427

0.762

-0.405

-0.529

De'Andre Hunter

SF

atl

0.530

0.362

0.036

-0.970

-0.420

-0.689

-0.788

-0.852

-0.435

0.471

-0.332

-0.069

0.069

0.497

-0.081

-0.219

-0.022

0.343

0.344

Kevin Huerter

SG

atl

0.218

-1.083

-0.277

-0.129

-0.558

-0.676

-0.877

-0.719

-0.429

0.005

-0.168

-0.118

-0.078

0.590

0.857

0.637

0.410

-1.193

-1.304

# finishing prepping data for KMeans procedure
roleKMeans_prep <- roleKMeans_prep %>%
  column_to_rownames(var = "Name") %>%
  select(-Team, -POS)

Let’s check our Elbow plot to get an idea of the clustering.

# removing text for visualizations and standardizing
role_rm <- role %>%
  select(-Name, -POS, -Team) %>%
  mutate(across(where(is.numeric), standardize))

fviz_nbclust(role_rm, kmeans, method = "wss", k.max = 24) +
  theme_minimal() +
  labs(title = "The Elbow Method")

Exercise 11

a) What do you see from the Elbow plot? At what point do the returns diminish?

b) How many clusters does the Elbow plot suggest?

# creates consensus clusters
roleClust <- n_clusters(role_rm,
                        package = c("easystats", "NbClust"),
                        standardize = FALSE, n_max = 10)
plot(roleClust) +
  labs(title = "Optimal Number of Clusters", x = "")

There’s a lot of variation in the preferred number of clusters. How many clusters would you choose to analyze? How many values of K would you like to analyze? This is totally up to you. Feel free to move back and forth through this section to analyze the data as much as you like.

Exercise 12 (Maybe a final analysis for them to do?)

We will be using K = 7 for the trade scenario portion, so we recommend you review through K = 7.

give them space to choose

# assume that they want K = 7.
stu_cluster <- 7

Ok, you’ve chosen K = 7. Here is an empty table for you to describe each of the clusters. As you grow in understanding of each of the clusters, fill it out with a few distinguishing words. Make sure you can glance at the table and understand what separates one cluster from another.

stu_role_table <- tibble(
  Cluster = 1:stu_cluster,
  Description = "")

stu_role_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster

Description

1

2

3

4

5

6

7

We’ll begin by looking at the mean for each variable of a cluster. Remember, this can help us identify variables that are not useful and get a general understanding of the characteristics of each cluster.

There may be a lot of variables, so we flipped the coordinates of the plot to make it easier to read. A bar to the right indicates a positive association and a bar to the left indicates a negative association.

set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_cluster, nstart = 50)
# creating factor levels for role
role_levels <- colnames(role)

# creates a dataset of each variable and the standardized center and graphs it
as_tibble(roleKMeans$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
  mutate(variable = factor(variable, role_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  geom_hline(yintercept = 0) +
  facet_grid(cols = vars(cluster), switch = "both") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.x = element_blank(),
        legend.position = "none")

Sift through the variables to see if any are unused throughout the clusters. If so, this indicates that the variable does not help differentiate the data into clusters. You can remove it here:

# if the student wants to remove a variable enter it here
role_var_rm <- 0

# reproducing roleKMeans without the removed variables
set.seed(100)
roleKMeans <- roleKMeans_prep %>%
  select(-all_of(role_var_rm)) %>%
  kmeans(centers = stu_cluster, nstart = 50)


role <- role %>% select(-role_var_rm)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(role_var_rm)

  # Now:
  data %>% select(all_of(role_var_rm))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
role_rm <- role_rm %>% select(-role_var_rm)

If you chose a large number of clusters, it may be difficult to use this visualization to remove unimportant variables. Instead, you should be able to see some of the important attributes of each of the clusters. Be thinking of identifiers for each cluster. Which variables are important throughout?

Let’s begin to analyze the numeric values of the centers. Look through each cluster’s characteristics. What sticks out to you?

role_summary <- role %>% summarise(
  across(where(is.numeric), mean)) %>%
  mutate(
    Clusters = "Data Average"
  ) %>% relocate(Clusters)

roleKcenters <- as_tibble(roleKMeans$cluster) %>%
  mutate(Name = role$Name) %>%
  rename(Clusters = value) %>% left_join(role, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)
  ) %>%
  mutate(
    Clusters = as.character(Clusters)
  ) %>% bind_rows(role_summary) %>%
  mutate(across(where(is.numeric), round, digits = 3),
         Height = round(Height, digits = 1),
         Weight = round(Height, digits = 1))

roleKcenters %>%
  reactable(
    defaultColDef = colDef(
      cell = color_tiles(.)))

Which clusters are scorers? Which are rebounders? Which have higher assist numbers? Higher 3-point shooting? Are any two clusters similar? What differentiates them?

At this point, give a short descriptor of each cluster. Each cluster should be uniquely described.

Let’s look at the size of each cluster.

roleKMeans$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster

Size

1

51

2

45

3

96

4

26

5

20

6

38

7

98

Does this surprise you? Which clusters are large and small? Does this fit with your perception of the makeup of NBA teams?

Let’s look at the distribution of the players.

rolefviz <- fviz_cluster(roleKMeans, roleKMeans_prep,
                         geom = "point",
                         show.clust.cent = TRUE, stand = FALSE,
                         pointsize = 1,
                         main = "Role K Clusters")
rolefviz

What do you notice from the visualization? Remember, the dimensions cannot represent all the data, so we may have clusters that overlap. Imagine that there is a third dimension “Z” that explains another 30%-40% of the data.

Where are the cluster centers and outliers? Which clusters seem to be the closest together? Furthest away? Are any clusters more isolated than others? Is this supported by your previous analysis?

If you had to add another cluster where would it be? If you had to remove a cluster, where would it be?

Let’s look at our prototype and outlier analysis.

First, we need to verify that our prototypes and outliers are prototypes and outliers. Now that we can change the number of clusters, its possible that you have some pretty small clusters. With a smaller sample size, we want to ensure that all our prototypes are indeed close to the cluster center and that all our outliers are indeed far away. In our K = 2 usage analysis, our prototypes were about 1-2.3 units away from the center. Our outliers were about 6-8.5. However, as K increases, the outlier distances should fall. Let’s look at the distances from the center of our top 3 prototypes and outliers from each cluster to see how they compare.

# standardizing the distances between the players
roleKMeans_scale <- as_tibble(roleKMeans$centers) %>%
  mutate(cluster = 1:n())

# creating appropriate tibble for distance formula
role_fittedKMeans <- roleKMeans$cluster %>%
  as_tibble() %>%
  rename(cluster = value) %>% left_join(roleKMeans_scale) %>% select(-cluster)
Joining with `by = join_by(cluster)`
# distance from cluster center
distances <- sqrt(rowSums((role_rm - role_fittedKMeans)^ 2)) %>%
  as_tibble() %>%
  rename(distance = value) %>% 
  mutate(
    Name = role$Name,
    Cluster = roleKMeans$cluster)

master_distances <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>%
  arrange(Cluster, distance) %>%
  select(-outlier_rank, -proto_rank) %>%
  relocate(distance, .after = Category) %>%
  relocate(Name, .after = Category)

master_distances %>%
  mutate(distance = round(distance, digits = 4)) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 3, width = 1.3)

Cluster

Category

Name

distance

1

Prototype

Trendon Watford

1.8523

1

Prototype

Isaiah Roby

1.8794

1

Prototype

John Collins

2.0773

1

Outlier

Isaiah Jackson

5.3644

1

Outlier

Tristan Thompson

6.8757

1

Outlier

Jakob Poeltl

7.1035

2

Prototype

Eric Bledsoe

1.6470

2

Prototype

Marcus Smart

1.6596

2

Prototype

Raul Neto

1.7083

2

Outlier

Josh Giddey

3.7481

2

Outlier

Jose Alvarado

3.8303

2

Outlier

Draymond Green

5.0848

3

Prototype

Coby White

1.3340

3

Prototype

Saddiq Bey

1.4581

3

Prototype

Lonnie Walker IV

1.4612

3

Outlier

Mike Muscala

4.0470

3

Outlier

Klay Thompson

4.1396

3

Outlier

Kevin Love

4.4379

4

Prototype

Ivica Zubac

1.9336

4

Prototype

Bismack Biyombo

1.9389

4

Prototype

Nic Claxton

2.3272

4

Outlier

Rudy Gobert

4.6189

4

Outlier

JaVale McGee

4.6530

4

Outlier

Thaddeus Young

5.0444

5

Prototype

Karl-Anthony Towns

2.0831

5

Prototype

Pascal Siakam

2.4719

5

Prototype

Jonas Valanciunas

2.5886

5

Outlier

Giannis Antetokounmpo

5.5937

5

Outlier

Joel Embiid

5.9766

5

Outlier

DeMarcus Cousins

6.0388

6

Prototype

Khris Middleton

1.5916

6

Prototype

Bradley Beal

1.6040

6

Prototype

Jaylen Brown

1.8622

6

Outlier

James Harden

4.0727

6

Outlier

Luka Doncic

4.0863

6

Outlier

Trae Young

4.1980

7

Prototype

Torrey Craig

1.3072

7

Prototype

Torrey Craig1

1.6834

7

Prototype

CJ Elleby

1.7221

7

Outlier

Xavier Tillman

4.2227

7

Outlier

Thaddeus Young1

4.2333

7

Outlier

Gary Payton II

5.0684

Which prototypes are the strongest prototypes? Which prototypes do you trust the most? Which are the strongest outliers? Would you disqualify any outliers or prototypes from the analysis (i.e. a supposed outlier is not far enough from the center or a labeled prototype is too far from the center).

Is this too long? I could remove the two long outliers table and only use the shorter one?

If you wish to disqualify a player from analysis, do it here:

Provide a space for the student to remove player’s from the analysis. Assume student disqualifies Nic Claxton. Just for the heck of it.

disqualify <- c("Nic Claxton")
roleKMeans$size %>% as_tibble() %>%
  rename(Size = value) %>% 
  mutate(Cluster = 1:n()) %>%
  relocate(Cluster, .before = Size) %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster

Size

1

51

2

45

3

96

4

26

5

20

6

38

7

98

Look again at the size of each cluster. Does this help explain any of your findings?

These outliers can be very different from each other. We’ll need to look into them to see what kind of players they are. Once again, we’ll show you the top 3 of each category first, and afterward a smaller table with only the top player.

# creating a master document with all of the prototypes and all of the outliers.
mast_dist_slice <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 4 | proto_rank < 4) %>%
  mutate(
    Category = if_else(proto_rank < 4, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(role) %>% arrange(Cluster, desc(Category)) %>%
  filter(Name != disqualify)
Joining with `by = join_by(Name)`
mast_dist_slice %>%
  mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(4:7), width = .6) %>%
  width(j = c(8:14), width = .95)

Name

Cluster

Category

POS

Team

Height

Weight

PTSPerMin

ASTPerMin

TOPerMin

STLPerMin

ORPerMin

DRPerMin

BLKPerMin

PFPerMin

FGP

FGMPerMin

FGAPerMin

3PP

3PMPerMin

3PAPerMin

FTP

FTMPerMin

FTAPerMin

John Collins

1

Prototype

PF

atl

81

235

0.526

0.058

0.036

0.019

0.055

0.198

0.032

0.097

0.526

0.205

0.386

0.364

0.039

0.107

0.793

0.081

0.101

Isaiah Roby

1

Prototype

PF

okc

80

230

0.479

0.076

0.047

0.038

0.081

0.152

0.038

0.114

0.514

0.175

0.341

0.444

0.047

0.104

0.672

0.081

0.123

Trendon Watford

1

Prototype

PF

por

81

240

0.420

0.094

0.050

0.028

0.066

0.166

0.033

0.133

0.532

0.166

0.309

0.237

0.011

0.044

0.755

0.083

0.110

Isaiah Jackson

1

Outlier

F

ind

82

205

0.553

0.020

0.073

0.047

0.113

0.167

0.093

0.173

0.563

0.213

0.380

0.313

0.007

0.027

0.682

0.113

0.160

Tristan Thompson

1

Outlier

C

sac

81

254

0.408

0.039

0.066

0.026

0.158

0.197

0.026

0.112

0.503

0.164

0.329

1.000

0.000

0.000

0.533

0.072

0.132

Jakob Poeltl

1

Outlier

C

sa

85

245

0.466

0.097

0.055

0.024

0.134

0.190

0.059

0.107

0.618

0.207

0.338

1.000

0.000

0.000

0.495

0.048

0.097

Marcus Smart

2

Prototype

PG

bos

75

220

0.375

0.183

0.068

0.053

0.019

0.099

0.009

0.071

0.418

0.130

0.313

0.331

0.053

0.158

0.793

0.062

0.077

Eric Bledsoe

2

Prototype

SG

lac

73

214

0.393

0.167

0.083

0.052

0.020

0.115

0.016

0.063

0.421

0.143

0.345

0.313

0.036

0.119

0.761

0.063

0.087

Raul Neto

2

Prototype

PG

wsh

73

180

0.383

0.158

0.056

0.041

0.010

0.087

0.000

0.077

0.463

0.148

0.321

0.292

0.026

0.087

0.769

0.061

0.077

Draymond Green

2

Outlier

PF

gs

78

230

0.260

0.242

0.104

0.045

0.035

0.218

0.038

0.104

0.525

0.100

0.194

0.296

0.010

0.042

0.659

0.045

0.069

Jose Alvarado

2

Outlier

PG

no

72

179

0.396

0.182

0.045

0.084

0.032

0.091

0.006

0.091

0.446

0.156

0.351

0.291

0.039

0.130

0.679

0.045

0.065

Josh Giddey

2

Outlier

SG

okc

80

205

0.397

0.203

0.102

0.029

0.057

0.190

0.013

0.051

0.419

0.165

0.394

0.263

0.032

0.124

0.709

0.032

0.048

Coby White

3

Prototype

PG

chi

77

195

0.462

0.105

0.040

0.018

0.011

0.098

0.007

0.080

0.433

0.167

0.385

0.385

0.080

0.211

0.857

0.047

0.055

Saddiq Bey

3

Prototype

SF

det

79

215

0.488

0.085

0.036

0.027

0.039

0.124

0.006

0.048

0.396

0.167

0.421

0.346

0.079

0.224

0.827

0.079

0.094

Lonnie Walker IV

3

Prototype

G

sa

76

204

0.526

0.096

0.043

0.026

0.013

0.100

0.013

0.061

0.407

0.191

0.474

0.314

0.070

0.217

0.784

0.074

0.091

Kevin Love

3

Outlier

PF

cle

80

251

0.604

0.098

0.058

0.018

0.053

0.271

0.009

0.062

0.430

0.196

0.458

0.392

0.111

0.284

0.838

0.098

0.120

Klay Thompson

3

Outlier

SG

gs

78

215

0.694

0.095

0.044

0.017

0.017

0.116

0.017

0.058

0.429

0.262

0.609

0.385

0.122

0.316

0.902

0.048

0.054

Mike Muscala

3

Outlier

C

okc

82

240

0.580

0.036

0.022

0.029

0.036

0.181

0.043

0.094

0.456

0.188

0.420

0.429

0.116

0.275

0.842

0.080

0.094

Ivica Zubac

4

Prototype

C

lac

84

240

0.422

0.066

0.061

0.020

0.119

0.230

0.041

0.111

0.626

0.168

0.266

0.000

0.000

0.000

0.727

0.090

0.123

Bismack Biyombo

4

Prototype

C

phx

80

255

0.411

0.043

0.050

0.021

0.128

0.206

0.050

0.135

0.593

0.170

0.284

0.000

0.000

0.000

0.535

0.078

0.142

JaVale McGee

4

Outlier

C

phx

84

270

0.582

0.038

0.082

0.019

0.139

0.285

0.070

0.152

0.629

0.247

0.392

0.222

0.000

0.006

0.699

0.089

0.127

Thaddeus Young

4

Outlier

PF

sa

80

235

0.430

0.162

0.085

0.063

0.106

0.141

0.021

0.106

0.578

0.197

0.345

0.000

0.000

0.014

0.455

0.028

0.056

Rudy Gobert

4

Outlier

C

utah

85

258

0.486

0.034

0.056

0.022

0.115

0.343

0.065

0.084

0.713

0.171

0.240

0.000

0.000

0.003

0.690

0.143

0.209

Karl-Anthony Towns

5

Prototype

C

min

83

248

0.737

0.108

0.093

0.030

0.078

0.216

0.033

0.108

0.529

0.260

0.491

0.410

0.060

0.147

0.822

0.156

0.189

Jonas Valanciunas

5

Prototype

C

no

83

265

0.587

0.086

0.079

0.020

0.102

0.274

0.026

0.109

0.544

0.228

0.419

0.361

0.026

0.069

0.820

0.106

0.129

Pascal Siakam

5

Prototype

PF

tor

81

230

0.602

0.140

0.071

0.034

0.050

0.174

0.016

0.087

0.494

0.232

0.470

0.344

0.029

0.084

0.749

0.111

0.148

DeMarcus Cousins

5

Outlier

C

den

82

270

0.640

0.122

0.158

0.043

0.115

0.281

0.029

0.216

0.456

0.216

0.475

0.324

0.058

0.173

0.736

0.151

0.201

Giannis Antetokounmpo

5

Outlier

PF

mil

83

242

0.909

0.176

0.100

0.033

0.061

0.292

0.043

0.097

0.553

0.313

0.565

0.293

0.033

0.109

0.722

0.252

0.347

Joel Embiid

5

Outlier

C

phi

84

280

0.905

0.124

0.092

0.033

0.062

0.284

0.044

0.080

0.499

0.290

0.580

0.371

0.041

0.109

0.814

0.284

0.349

Jaylen Brown

6

Prototype

SG

bos

78

223

0.702

0.104

0.080

0.033

0.024

0.158

0.009

0.074

0.473

0.259

0.548

0.358

0.074

0.208

0.758

0.110

0.143

Khris Middleton

6

Prototype

SF

mil

79

222

0.620

0.167

0.090

0.037

0.019

0.148

0.009

0.074

0.443

0.210

0.478

0.373

0.077

0.204

0.890

0.120

0.136

Bradley Beal

6

Prototype

SG

wsh

75

207

0.644

0.183

0.094

0.025

0.028

0.106

0.011

0.067

0.451

0.242

0.536

0.300

0.044

0.147

0.833

0.117

0.142

Trae Young

6

Outlier

PG

atl

73

180

0.814

0.278

0.115

0.026

0.020

0.089

0.003

0.049

0.460

0.269

0.582

0.382

0.089

0.229

0.904

0.189

0.209

James Harden

6

Outlier

SG

bkn

77

220

0.608

0.276

0.130

0.035

0.027

0.189

0.019

0.065

0.414

0.178

0.432

0.332

0.062

0.189

0.869

0.186

0.216

Luka Doncic

6

Outlier

PG

dal

79

230

0.802

0.246

0.127

0.034

0.025

0.234

0.017

0.062

0.457

0.280

0.610

0.353

0.088

0.249

0.744

0.158

0.212

Torrey Craig

7

Prototype

SF

ind

79

221

0.320

0.054

0.039

0.025

0.059

0.133

0.020

0.094

0.456

0.123

0.271

0.333

0.044

0.133

0.771

0.025

0.034

Torrey Craig1

7

Prototype

SF

phx

79

221

0.332

0.058

0.048

0.038

0.048

0.159

0.029

0.101

0.450

0.130

0.284

0.323

0.053

0.173

0.706

0.019

0.029

CJ Elleby

7

Prototype

SG

por

78

200

0.287

0.074

0.050

0.030

0.054

0.139

0.015

0.099

0.393

0.104

0.262

0.294

0.030

0.109

0.714

0.050

0.069

Gary Payton II

7

Outlier

SG

gs

75

195

0.403

0.051

0.034

0.080

0.057

0.142

0.017

0.102

0.616

0.170

0.273

0.358

0.034

0.097

0.603

0.028

0.045

Xavier Tillman

7

Outlier

C

mem

80

245

0.364

0.091

0.045

0.068

0.091

0.136

0.023

0.091

0.454

0.136

0.311

0.204

0.015

0.068

0.648

0.068

0.098

Thaddeus Young1

7

Outlier

PF

tor

80

235

0.344

0.093

0.044

0.066

0.082

0.158

0.022

0.093

0.465

0.142

0.301

0.395

0.038

0.093

0.481

0.027

0.055

Below is the smaller table.

mast_dist_slice1 <- distances %>%
  group_by(Cluster) %>%
  mutate(
    outlier_rank = order(order(distance, decreasing=TRUE)),
    proto_rank = order(order(distance, decreasing = FALSE))) %>%
  filter(outlier_rank < 2 | proto_rank < 2) %>%
  mutate(
    Category = if_else(proto_rank < 2, "Prototype", "Outlier")
  ) %>% 
  select(Name, Cluster, Category) %>%
  left_join(role) %>% arrange(desc(Category), Cluster) %>%
  filter(Name != disqualify)
Joining with `by = join_by(Name)`
mast_dist_slice1 %>%
  mutate(across(where(is.numeric), ~round(.x, digits = 3))) %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(4:7), width = .6) %>%
  width(j = c(8:14), width = .95)

Name

Cluster

Category

POS

Team

Height

Weight

PTSPerMin

ASTPerMin

TOPerMin

STLPerMin

ORPerMin

DRPerMin

BLKPerMin

PFPerMin

FGP

FGMPerMin

FGAPerMin

3PP

3PMPerMin

3PAPerMin

FTP

FTMPerMin

FTAPerMin

Trendon Watford

1

Prototype

PF

por

81

240

0.420

0.094

0.050

0.028

0.066

0.166

0.033

0.133

0.532

0.166

0.309

0.237

0.011

0.044

0.755

0.083

0.110

Eric Bledsoe

2

Prototype

SG

lac

73

214

0.393

0.167

0.083

0.052

0.020

0.115

0.016

0.063

0.421

0.143

0.345

0.313

0.036

0.119

0.761

0.063

0.087

Coby White

3

Prototype

PG

chi

77

195

0.462

0.105

0.040

0.018

0.011

0.098

0.007

0.080

0.433

0.167

0.385

0.385

0.080

0.211

0.857

0.047

0.055

Ivica Zubac

4

Prototype

C

lac

84

240

0.422

0.066

0.061

0.020

0.119

0.230

0.041

0.111

0.626

0.168

0.266

0.000

0.000

0.000

0.727

0.090

0.123

Karl-Anthony Towns

5

Prototype

C

min

83

248

0.737

0.108

0.093

0.030

0.078

0.216

0.033

0.108

0.529

0.260

0.491

0.410

0.060

0.147

0.822

0.156

0.189

Khris Middleton

6

Prototype

SF

mil

79

222

0.620

0.167

0.090

0.037

0.019

0.148

0.009

0.074

0.443

0.210

0.478

0.373

0.077

0.204

0.890

0.120

0.136

Torrey Craig

7

Prototype

SF

ind

79

221

0.320

0.054

0.039

0.025

0.059

0.133

0.020

0.094

0.456

0.123

0.271

0.333

0.044

0.133

0.771

0.025

0.034

Jakob Poeltl

1

Outlier

C

sa

85

245

0.466

0.097

0.055

0.024

0.134

0.190

0.059

0.107

0.618

0.207

0.338

1.000

0.000

0.000

0.495

0.048

0.097

Draymond Green

2

Outlier

PF

gs

78

230

0.260

0.242

0.104

0.045

0.035

0.218

0.038

0.104

0.525

0.100

0.194

0.296

0.010

0.042

0.659

0.045

0.069

Kevin Love

3

Outlier

PF

cle

80

251

0.604

0.098

0.058

0.018

0.053

0.271

0.009

0.062

0.430

0.196

0.458

0.392

0.111

0.284

0.838

0.098

0.120

Thaddeus Young

4

Outlier

PF

sa

80

235

0.430

0.162

0.085

0.063

0.106

0.141

0.021

0.106

0.578

0.197

0.345

0.000

0.000

0.014

0.455

0.028

0.056

DeMarcus Cousins

5

Outlier

C

den

82

270

0.640

0.122

0.158

0.043

0.115

0.281

0.029

0.216

0.456

0.216

0.475

0.324

0.058

0.173

0.736

0.151

0.201

Trae Young

6

Outlier

PG

atl

73

180

0.814

0.278

0.115

0.026

0.020

0.089

0.003

0.049

0.460

0.269

0.582

0.382

0.089

0.229

0.904

0.189

0.209

Gary Payton II

7

Outlier

SG

gs

75

195

0.403

0.051

0.034

0.080

0.057

0.142

0.017

0.102

0.616

0.170

0.273

0.358

0.034

0.097

0.603

0.028

0.045

Look through the prototypes and outliers. Compare their results with your previous findings. Do the prototypes of each cluster match up with your summary of the cluster? How do the outliers fit in? Two outliers can be very different. Pick a few outliers and determine their closest two clusters.

rolefviz

Analyze the K = 7 clusters as a whole. Are the clusters good? Do they have high intra-class similarity? What about a low intra-class similarity? If you were to do the analysis again, would you choose the same amount of clusters?

Compare lots of Ks

Select two values of K (between 2 and 10) to compare. This table can become very complex. Remember, the rows are the cluster assignment with the first value of K and the columns are the cluster assignment with the second value. Isolate and analyze one row or column at a time.

# let's say the student wants to compare K = 3 and K = 7
stu_clus1 <- 7
stu_clus2 <- 3
# ensures that the first chosen cluster is lower.
if(stu_clus1 > stu_clus2) {
  space = stu_clus1
  stu_clus1 = stu_clus2
  stu_clus2 = space
}

set.seed(100)
roleKMeans <- kmeans(roleKMeans_prep, centers = stu_clus1, nstart = 50)
set.seed(100)
roleK2Means <- kmeans(roleKMeans_prep, centers = stu_clus2, nstart = 50)

# creating a tibble of the cluster of each player for each K
clusters <- tibble(
  player = role$Name,
  Cluster = roleKMeans$cluster,
  clusK2 = roleK2Means$cluster
)

compare_table <- with(clusters, table(Cluster, clusK2)) %>%
  as_tibble() %>%
  pivot_wider(names_from = clusK2, values_from = n)

# tabulating clusters
compare_table %>%
  flextable() %>%
  align(align = "center", part = "all")

Cluster

1

2

3

4

5

6

7

1

0

9

17

0

8

38

0

2

1

35

79

0

0

0

90

3

50

1

0

26

12

0

8

Part 7: GM of Dallas Mavericks

Returning back to the Dallas Mavericks. Let’s take a look at how the Mavericks players were clustered in our role dataset. Let’s use K = 7. If you did not analyze K = 7 earlier, it is worth a look.

Below are a few visual reminders of each cluster’s characteristics.

# initializing our datasets a third time in case student decided to remove a variable
role <- nba %>%
  select(Name, POS, Team, Height, Weight, FGP, `3PP`, FTP,  PTSPerMin, ORPerMin, DRPerMin, ASTPerMin, STLPerMin, BLKPerMin, TOPerMin, PFPerMin, FGMPerMin, FGAPerMin, `3PMPerMin`, `3PAPerMin`, FTMPerMin, FTAPerMin) %>%
  mutate(across(where(is.numeric), round, digits = 4))

# standardizing the data for KMeans
roleKMeans_prep <- role %>%
  mutate(across(where(is.numeric), standardize)) %>%
  column_to_rownames(var = "Name") %>%
  select(-Team, -POS)

# creating K = 7 K-Means
set.seed(100)
role7Means <- kmeans(roleKMeans_prep, centers = 7, nstart = 50)

# bar graph of centers
as_tibble(role7Means$centers, rownames = "cluster") %>%
  pivot_longer(cols = c(Height:FTAPerMin), names_to = "variable") %>%
  mutate(variable = factor(variable, role_levels)) %>%
  ggplot(aes(x = variable, y = value, fill = cluster)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0) +
  coord_flip() +
  facet_grid(cols = vars(cluster), switch = "both") +
  labs(title = "Influence on the Cluster Assignment", x = "", y = "Cluster") +
  theme(axis.text.x = element_blank(),
        legend.position = "none")

# creating tibble of all the centers
role7centers <- as_tibble(role7Means$cluster) %>%
  mutate(Name = role$Name) %>%
  rename(Clusters = value) %>% left_join(role, by = "Name") %>%
  group_by(Clusters) %>%
  summarise(
    across(where(is.numeric), mean)) %>%
  mutate(Clusters = as.character(Clusters)) %>%
  bind_rows(role_summary) %>%
  mutate(across(where(is.numeric), round, digits = 3),
         Height = round(Height, digits = 1),
         Weight = round(Height, digits = 1))

# printing conditional formatting table
role7centers %>%
  reactable(
    defaultColDef = colDef(
      cell = color_tiles(.)
    ))

Before moving on, fill out this table to describe each cluster. Write a few descriptive words that distinguish each cluster. This will help you to organize your thoughts on each cluster. If you already completed this for K = 7 in the role dataset, then you are free to proceed.

stu_table <- tibble(
  Cluster = 1:7,
  Description = "")

stu_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster

Description

1

2

3

4

5

6

7

Caleb’s estimation of 7 clusters. I’d like to provide them a blank table to fill out somehow. Like a text file table with two columns.

Caleb_table <- tibble(
  Cluster = 1:7,
  Description = c("big men, mediocre scorers, kinda shoot deep",
                  "small point guards, facilitaters",
                  "meh players, 3 point shooters",
                  "big men, can't shoot deep at all",
                  "high-volume players, generally tall",
                  "high-volume players, average height",
                  "low production, very mediocre, likely corner 3 players"))

Caleb_table %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 2, width = 4)

Cluster

Description

1

big men, mediocre scorers, kinda shoot deep

2

small point guards, facilitaters

3

meh players, 3 point shooters

4

big men, can't shoot deep at all

5

high-volume players, generally tall

6

high-volume players, average height

7

low production, very mediocre, likely corner 3 players

Mavericks Offseason Analysis

Now, let’s look at the cluster assignments of our ten Dallas Mavericks players.

role7Means_players <- role7Means$cluster %>%
  as_tibble() %>%
  rename(Cluster = value) %>%
  mutate(
    Name = role$Name
  ) %>%
  left_join(role, by = "Name") %>%
  left_join(usage %>% select(Name, MIN), by = "Name") %>%
  relocate(Cluster, .after = Name) %>%
  relocate(MIN, .after = POS) %>%
  arrange(Cluster)

dallas_role2022 <- role7Means_players %>%
  filter(Team == "dal") %>%
  select(-Team)

dallas_role2022 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:10), width = .6) %>%
  width(j = c(11:14), width = .95)

Name

Cluster

POS

MIN

Height

Weight

FGP

3PP

FTP

PTSPerMin

ORPerMin

DRPerMin

ASTPerMin

STLPerMin

BLKPerMin

TOPerMin

PFPerMin

FGMPerMin

FGAPerMin

3PMPerMin

3PAPerMin

FTMPerMin

FTAPerMin

Dwight Powell

1

C

21.9

82

240

0.671

0.351

0.783

0.3973

0.0959

0.1279

0.0548

0.0228

0.0228

0.0365

0.1233

0.1507

0.2237

0.0091

0.0228

0.0913

0.1187

Jalen Brunson

2

PG

31.9

73

190

0.502

0.373

0.840

0.5110

0.0157

0.1066

0.1505

0.0251

0.0000

0.0502

0.0596

0.2006

0.4013

0.0376

0.1003

0.0721

0.0846

Tim Hardaway Jr.

3

SF

29.6

77

205

0.394

0.336

0.757

0.4797

0.0101

0.1149

0.0743

0.0304

0.0034

0.0270

0.0608

0.1689

0.4257

0.0811

0.2432

0.0642

0.0845

Kristaps Porzingis

5

C

29.5

87

240

0.451

0.283

0.865

0.6508

0.0644

0.1966

0.0678

0.0237

0.0576

0.0542

0.0881

0.2271

0.5051

0.0475

0.1729

0.1458

0.1695

Luka Doncic

6

PG

35.4

79

230

0.457

0.353

0.744

0.8023

0.0254

0.2345

0.2458

0.0339

0.0169

0.1271

0.0621

0.2797

0.6102

0.0876

0.2486

0.1582

0.2119

Dorian Finney-Smith

7

PF

33.1

79

220

0.471

0.395

0.675

0.3323

0.0453

0.0967

0.0574

0.0332

0.0151

0.0302

0.0695

0.1239

0.2628

0.0665

0.1631

0.0211

0.0302

Reggie Bullock

7

SF

28.0

78

205

0.401

0.360

0.833

0.3071

0.0179

0.1107

0.0429

0.0214

0.0071

0.0214

0.0571

0.1071

0.2643

0.0750

0.2071

0.0214

0.0250

Maxi Kleber

7

PF

24.6

82

240

0.398

0.325

0.708

0.2846

0.0488

0.1911

0.0488

0.0203

0.0407

0.0325

0.0935

0.0976

0.2439

0.0569

0.1748

0.0325

0.0447

Josh Green

7

SG

15.5

77

200

0.508

0.359

0.689

0.3097

0.0516

0.1032

0.0774

0.0452

0.0129

0.0452

0.1097

0.1226

0.2452

0.0258

0.0774

0.0323

0.0452

Sterling Brown

7

SF

12.8

77

219

0.381

0.304

0.933

0.2578

0.0391

0.1953

0.0547

0.0234

0.0078

0.0391

0.0859

0.0937

0.2500

0.0469

0.1484

0.0234

0.0234

What do you notice about the player assignments? How many clusters do the Mavericks have represented? Which cluster is the most common on the Mavericks team?

Why is cluster 7 the most common? What kind of player is in cluster 7?

The Mavericks experienced a bit of turnover in the 2022 offseason. They’d already traded away C Kristaps Porzingis for SG Spencer Dinwiddie at the end of the 2022 season, and they lost productive SG Jalen Brunson to free agency. They traded away SF Sterling Brown and other assets for C Christian Wood during the 2022 Summer.

Let’s assess the offseason moves of the Dallas Mavericks by looking at the opening day roster for 2023 and its cluster distribution. Below are the eleven players on the Dallas Mavericks roster at Game 1 of the 2023 season, a loss against the Phoenix Suns.

dallas_role2023 <- role7Means_players %>%
  filter(Name == "JaVale McGee" | Name == "Reggie Bullock" | Name == "Dorian Finney-Smith" | Name == "Spencer Dinwiddie" | Name == "Luka Doncic" | Name == "Tim Hardaway Jr." | Name == "Maxi Kleber" | Name == "Christian Wood" | Name == "Josh Green" | Name == "Dwight Powell" | Name == "Davis Bertans") %>%
  select(-Team) %>%
  arrange(Cluster)

dallas_role2023 %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:10), width = .6) %>%
  width(j = c(11:14), width = .95)

Name

Cluster

POS

MIN

Height

Weight

FGP

3PP

FTP

PTSPerMin

ORPerMin

DRPerMin

ASTPerMin

STLPerMin

BLKPerMin

TOPerMin

PFPerMin

FGMPerMin

FGAPerMin

3PMPerMin

3PAPerMin

FTMPerMin

FTAPerMin

Dwight Powell

1

C

21.9

82

240

0.671

0.351

0.783

0.3973

0.0959

0.1279

0.0548

0.0228

0.0228

0.0365

0.1233

0.1507

0.2237

0.0091

0.0228

0.0913

0.1187

Tim Hardaway Jr.

3

SF

29.6

77

205

0.394

0.336

0.757

0.4797

0.0101

0.1149

0.0743

0.0304

0.0034

0.0270

0.0608

0.1689

0.4257

0.0811

0.2432

0.0642

0.0845

Spencer Dinwiddie

3

PG

30.2

77

215

0.376

0.310

0.811

0.4172

0.0265

0.1291

0.1921

0.0199

0.0066

0.0563

0.0795

0.1391

0.3709

0.0530

0.1689

0.0861

0.1093

Davis Bertans

3

SF

14.7

82

225

0.351

0.319

0.933

0.3878

0.0136

0.1088

0.0340

0.0204

0.0136

0.0272

0.1088

0.1224

0.3401

0.0952

0.2857

0.0544

0.0612

JaVale McGee

4

C

15.8

84

270

0.629

0.222

0.699

0.5823

0.1392

0.2848

0.0380

0.0190

0.0696

0.0823

0.1519

0.2468

0.3924

0.0000

0.0063

0.0886

0.1266

Christian Wood

5

C

30.8

82

214

0.501

0.390

0.623

0.5812

0.0519

0.2760

0.0747

0.0260

0.0325

0.0617

0.0812

0.2110

0.4188

0.0617

0.1591

0.0974

0.1591

Luka Doncic

6

PG

35.4

79

230

0.457

0.353

0.744

0.8023

0.0254

0.2345

0.2458

0.0339

0.0169

0.1271

0.0621

0.2797

0.6102

0.0876

0.2486

0.1582

0.2119

Dorian Finney-Smith

7

PF

33.1

79

220

0.471

0.395

0.675

0.3323

0.0453

0.0967

0.0574

0.0332

0.0151

0.0302

0.0695

0.1239

0.2628

0.0665

0.1631

0.0211

0.0302

Reggie Bullock

7

SF

28.0

78

205

0.401

0.360

0.833

0.3071

0.0179

0.1107

0.0429

0.0214

0.0071

0.0214

0.0571

0.1071

0.2643

0.0750

0.2071

0.0214

0.0250

Maxi Kleber

7

PF

24.6

82

240

0.398

0.325

0.708

0.2846

0.0488

0.1911

0.0488

0.0203

0.0407

0.0325

0.0935

0.0976

0.2439

0.0569

0.1748

0.0325

0.0447

Josh Green

7

SG

15.5

77

200

0.508

0.359

0.689

0.3097

0.0516

0.1032

0.0774

0.0452

0.0129

0.0452

0.1097

0.1226

0.2452

0.0258

0.0774

0.0323

0.0452

The roster looks somewhat similar, but what classification of player did the Mavericks lose in the 2022 season and not return in the 2023 season? What classification of player did the Mavericks gain in the 2023 season?

Answer: They lost a cluster 2 player, lost a cluster 7 player, gained two cluster 3 players, and a cluster 4 player.

What kind of player is in cluster 2? What would losing this kind of player do to a team?

Dallas Mavericks Trade

Let’s say you’re the GM of the Dallas Mavericks after game 1 of the 2022-2023 season. Which players would you consider trading and what cluster of player would you hope to acquire? Which players are you willing to give up?

Answer: I think the correct answer here is give up any of cluster 3 or 7 for a cluster 2. Maxi Kleber is the most expendable because he has some features of 1,4,5 and some of 7. And they have excess of these players.

Select four players you are willing to trade and one cluster that you are looking for.

# let's say the student is smart and chooses
trading <- c("Davis Bertans", "Spencer Dinwiddie", "Maxi Kleber", "Dwight Powell")
# and is looking for a player in cluster...
looking <- 2
looking_clus <- role7Means_players %>%
  filter(Cluster == looking)

looking_clus %>%
  flextable() %>%
  align(align = "center", part = "all") %>%
  width(j = 1, width = 1.3) %>%
  width(j = c(2:5), width = .6) %>%
  width(j = c(6:12), width = .95)

Name

Cluster

POS

MIN

Team

Height

Weight

FGP

3PP

FTP

PTSPerMin

ORPerMin

DRPerMin

ASTPerMin

STLPerMin

BLKPerMin

TOPerMin

PFPerMin

FGMPerMin

FGAPerMin

3PMPerMin

3PAPerMin

FTMPerMin

FTAPerMin

Lou Williams

2

SG

14.3

atl

73

175

0.391

0.363

0.859

0.4406

0.0210

0.0909

0.1329

0.0350

0.0070

0.0559

0.0629

0.1538

0.3986

0.0490

0.1259

0.0839

0.0979

Dennis Schroder

2

PG

29.2

bos

75

172

0.440

0.349

0.848

0.4932

0.0205

0.0959

0.1438

0.0274

0.0034

0.0719

0.0822

0.1781

0.4075

0.0479

0.1336

0.0856

0.1027

Marcus Smart

2

PG

32.3

bos

75

220

0.418

0.331

0.793

0.3746

0.0186

0.0991

0.1827

0.0526

0.0093

0.0681

0.0712

0.1300

0.3127

0.0526

0.1579

0.0619

0.0774

Ish Smith

2

PG

13.8

cha

72

175

0.395

0.400

0.632

0.3261

0.0217

0.0870

0.1884

0.0362

0.0217

0.0725

0.0652

0.1449

0.3623

0.0217

0.0507

0.0217

0.0362

Lonzo Ball

2

PG

34.6

chi

78

190

0.423

0.423

0.750

0.3757

0.0289

0.1272

0.1474

0.0520

0.0260

0.0665

0.0694

0.1329

0.3150

0.0896

0.2139

0.0173

0.0231

Alex Caruso

2

SG

28.0

chi

76

186

0.398

0.333

0.795

0.2643

0.0286

0.1000

0.1429

0.0607

0.0143

0.0500

0.0929

0.0893

0.2214

0.0357

0.1107

0.0500

0.0643

Ricky Rubio

2

PG

28.5

cle

75

190

0.363

0.339

0.854

0.4596

0.0140

0.1298

0.2316

0.0491

0.0070

0.0912

0.0772

0.1544

0.4246

0.0596

0.1789

0.0912

0.1053

Brandon Goodwin

2

G

13.9

cle

72

180

0.416

0.345

0.632

0.3453

0.0288

0.1079

0.1799

0.0504

0.0000

0.0719

0.0791

0.1295

0.3094

0.0360

0.1079

0.0504

0.0791

Jalen Brunson

2

PG

31.9

dal

73

190

0.502

0.373

0.840

0.5110

0.0157

0.1066

0.1505

0.0251

0.0000

0.0502

0.0596

0.2006

0.4013

0.0376

0.1003

0.0721

0.0846

Facundo Campazzo

2

PG

18.2

den

70

195

0.361

0.301

0.769

0.2802

0.0220

0.0769

0.1868

0.0549

0.0220

0.0549

0.1044

0.0879

0.2527

0.0495

0.1648

0.0495

0.0659

Cory Joseph

2

PG

24.6

det

75

200

0.445

0.414

0.885

0.3252

0.0163

0.0894

0.1463

0.0244

0.0122

0.0528

0.0935

0.1098

0.2520

0.0407

0.0976

0.0610

0.0691

Killian Hayes

2

PG

25.0

det

77

195

0.383

0.263

0.770

0.2760

0.0200

0.1040

0.1680

0.0480

0.0200

0.0680

0.1120

0.1080

0.2800

0.0280

0.1000

0.0360

0.0440

Saben Lee

2

PG

16.3

det

74

183

0.390

0.233

0.789

0.3436

0.0307

0.1166

0.1779

0.0613

0.0184

0.0613

0.0736

0.1166

0.2945

0.0245

0.0982

0.0920

0.1166

Draymond Green

2

PF

28.9

gs

78

230

0.525

0.296

0.659

0.2595

0.0346

0.2180

0.2422

0.0450

0.0381

0.1038

0.1038

0.1003

0.1938

0.0104

0.0415

0.0450

0.0692

Kevin Porter Jr.

2

SG

31.3

hou

76

203

0.415

0.375

0.642

0.4984

0.0224

0.1182

0.1981

0.0351

0.0128

0.0990

0.0831

0.1757

0.4217

0.0799

0.2173

0.0639

0.1022

Josh Christopher

2

SG

18.0

hou

77

215

0.448

0.296

0.735

0.4389

0.0389

0.1000

0.1111

0.0500

0.0111

0.0833

0.0722

0.1667

0.3778

0.0444

0.1444

0.0611

0.0833

D.J. Augustin

2

G

15.0

hou

71

183

0.404

0.406

0.868

0.3600

0.0133

0.0667

0.1467

0.0200

0.0000

0.0867

0.0333

0.1067

0.2667

0.0733

0.1867

0.0667

0.0733

Tyrese Haliburton

2

PG

36.1

ind

77

185

0.502

0.416

0.849

0.4848

0.0222

0.0970

0.2659

0.0499

0.0166

0.0886

0.0526

0.1717

0.3435

0.0609

0.1468

0.0776

0.0914

T.J. McConnell

2

PG

24.1

ind

73

190

0.481

0.303

0.826

0.3527

0.0290

0.1079

0.2033

0.0456

0.0166

0.0456

0.0830

0.1535

0.3195

0.0166

0.0498

0.0290

0.0373

Keifer Sykes

2

G

17.7

ind

71

167

0.363

0.300

0.882

0.3164

0.0169

0.0678

0.1073

0.0226

0.0056

0.0565

0.0904

0.1243

0.3333

0.0452

0.1582

0.0282

0.0282

Eric Bledsoe

2

SG

25.2

lac

73

214

0.421

0.313

0.761

0.3929

0.0198

0.1151

0.1667

0.0516

0.0159

0.0833

0.0635

0.1429

0.3452

0.0357

0.1190

0.0635

0.0873

De'Anthony Melton

2

SG

22.7

mem

74

200

0.404

0.374

0.750

0.4758

0.0396

0.1586

0.1189

0.0617

0.0220

0.0661

0.0793

0.1674

0.4185

0.0837

0.2247

0.0529

0.0705

Tyus Jones

2

PG

21.2

mem

72

196

0.451

0.390

0.818

0.4104

0.0094

0.1038

0.2075

0.0425

0.0000

0.0283

0.0189

0.1604

0.3585

0.0519

0.1321

0.0330

0.0425

Kyle Lowry

2

PG

33.9

mia

72

196

0.440

0.377

0.851

0.3953

0.0147

0.1180

0.2212

0.0324

0.0088

0.0796

0.0826

0.1298

0.2950

0.0678

0.1799

0.0678

0.0826

Gabe Vincent

2

PG

23.4

mia

75

200

0.417

0.368

0.815

0.3718

0.0128

0.0641

0.1325

0.0385

0.0085

0.0598

0.0983

0.1325

0.3205

0.0769

0.2051

0.0256

0.0342

Jrue Holiday

2

PG

32.9

mil

75

205

0.501

0.411

0.761

0.5562

0.0304

0.1064

0.2067

0.0486

0.0122

0.0821

0.0608

0.2158

0.4316

0.0608

0.1459

0.0608

0.0821

Patrick Beverley

2

PG

25.4

min

73

180

0.406

0.343

0.722

0.3622

0.0433

0.1220

0.1811

0.0472

0.0354

0.0512

0.1181

0.1220

0.2953

0.0551

0.1654

0.0669

0.0906

Jordan McLaughlin

2

PG

14.5

min

71

185

0.440

0.318

0.750

0.2621

0.0276

0.0828

0.2000

0.0621

0.0138

0.0414

0.0621

0.0966

0.2207

0.0276

0.0966

0.0345

0.0414

Jose Alvarado

2

PG

15.4

no

72

179

0.446

0.291

0.679

0.3961

0.0325

0.0909

0.1818

0.0844

0.0065

0.0455

0.0909

0.1558

0.3506

0.0390

0.1299

0.0455

0.0649

Josh Giddey

2

SG

31.5

okc

80

205

0.419

0.263

0.709

0.3968

0.0571

0.1905

0.2032

0.0286

0.0127

0.1016

0.0508

0.1651

0.3937

0.0317

0.1238

0.0317

0.0476

Theo Maledon

2

PG

17.8

okc

76

175

0.375

0.293

0.790

0.3989

0.0225

0.1236

0.1236

0.0337

0.0112

0.0730

0.0730

0.1292

0.3483

0.0506

0.1629

0.0843

0.1124

Jalen Suggs

2

SG

27.2

orl

76

205

0.361

0.214

0.773

0.4338

0.0184

0.1103

0.1618

0.0441

0.0147

0.1103

0.1103

0.1507

0.4191

0.0331

0.1507

0.0956

0.1250

R.J. Hampton

2

PG

21.9

orl

76

175

0.383

0.350

0.641

0.3470

0.0183

0.1233

0.1142

0.0320

0.0091

0.0639

0.0731

0.1233

0.3242

0.0457

0.1324

0.0548

0.0822

Chris Paul

2

PG

32.9

phx

72

175

0.493

0.317

0.837

0.4468

0.0091

0.1216

0.3283

0.0578

0.0091

0.0729

0.0638

0.1702

0.3435

0.0304

0.0942

0.0790

0.0942

Cameron Payne

2

PG

22.0

phx

73

183

0.409

0.336

0.843

0.4909

0.0182

0.1182

0.2227

0.0318

0.0136

0.0818

0.0955

0.1864

0.4591

0.0545

0.1636

0.0591

0.0682

Dennis Smith Jr.

2

PG

17.3

por

74

205

0.418

0.222

0.656

0.3237

0.0289

0.1040

0.2081

0.0694

0.0173

0.0809

0.0809

0.1214

0.2948

0.0116

0.0405

0.0636

0.0983

Tyrese Haliburton1

2

PG

34.5

sac

77

185

0.457

0.413

0.837

0.4145

0.0232

0.0899

0.2145

0.0493

0.0203

0.0667

0.0406

0.1536

0.3333

0.0580

0.1420

0.0493

0.0580

Davion Mitchell

2

PG

27.7

sac

74

205

0.418

0.316

0.659

0.4152

0.0144

0.0650

0.1516

0.0253

0.0108

0.0542

0.0686

0.1697

0.4043

0.0469

0.1552

0.0253

0.0397

Derrick White1

2

PG

30.3

sa

76

190

0.426

0.314

0.869

0.4752

0.0165

0.0990

0.1848

0.0330

0.0297

0.0594

0.0792

0.1650

0.3828

0.0561

0.1749

0.0924

0.1089

Tre Jones

2

PG

16.6

sa

73

185

0.490

0.196

0.780

0.3614

0.0241

0.1084

0.2048

0.0361

0.0060

0.0422

0.0663

0.1446

0.2952

0.0060

0.0422

0.0602

0.0783

Malachi Flynn

2

PG

12.2

tor

73

175

0.393

0.333

0.625

0.3525

0.0164

0.0984

0.1311

0.0410

0.0082

0.0246

0.0820

0.1311

0.3443

0.0574

0.1639

0.0246

0.0410

Mike Conley

2

PG

28.6

utah

73

175

0.435

0.408

0.796

0.4790

0.0245

0.0839

0.1853

0.0455

0.0105

0.0594

0.0699

0.1678

0.3846

0.0804

0.2028

0.0629

0.0804

Ish Smith1

2

PG

22.0

wsh

72

175

0.457

0.357

0.600

0.3909

0.0227

0.1136

0.2364

0.0455

0.0227

0.0682

0.0727

0.1818

0.4000

0.0227

0.0682

0.0045

0.0091

Raul Neto

2

PG

19.6

wsh

73

180

0.463

0.292

0.769

0.3827

0.0102

0.0867

0.1582

0.0408

0.0000

0.0561

0.0765

0.1480

0.3214

0.0255

0.0867

0.0612

0.0765

Aaron Holiday

2

G

16.2

wsh

72

185

0.467

0.343

0.800

0.3765

0.0123

0.0864

0.1173

0.0370

0.0123

0.0617

0.0926

0.1481

0.3210

0.0370

0.0988

0.0432

0.0556

From the list, choose a player you like from a team that has several of these types of players. They’d be more likely to part ways. Assess the strengths of the pertinent players and propose a trade! How does it look?

Feel free to make the trades as complex as you wish, but try to choose something that the opposing team would agree to.

Defend your proposed trade using the cluster information. You may add in some basketball knowledge if you like.

What do you think of this process? What are the strengths and weaknesses of evaluating a team based on cluster membership?