library(tidyverse) #loads package
library(DescTools)
library(effectsize)
library(pwr)
<- read_csv("wimbledon_featured_matches.csv") #loads data tennis
Tennis Sample Size
Welcome video
Introduction
In this module, you will compute the required sample size for studies involving tennis data, using data from Wimbledon 2023 matches to estimate required parameters.
NOTE: R is the name of the programming language itself and RStudio is a convenient interface. To throw even more lingo in, you may be accessing RStudio through a web-based version called Posit Cloud. But R is the programming language you are learning :)
Getting started: Tennis data
The first step to any analysis in R is to load necessary packages and data.
You can think of packages like apps on your phone; they extend the functionality and give you access to many more features beyond what comes in the “base package”.
Running the following code will load the tidyverse
package, and the tennis
data we will be using in this lab.
TIP: As you follow along in the lab, you should run each corresponding code chunk in your .qmd document. To “Run” a code chunk, you can press the green “Play” button in the top right corner of the code chunk in your .qmd. You can also place your cursor anywhere in the line(s) of code you want to run and press “command + return” (Mac) or “Ctrl + Enter” (Windows).
TIP: Using a hashtag in R allows you to add comments to your code (in plain English). Data scientists often use comments to explain what each piece of the code is doing.
We can use the glimpse()
function to get a quick look (errr.. glimpse) at our tennis
data. The glimpse
code provides the number of observations (Rows) and the number of variables (Columns) in the dataset. The “Rows” and “Columns” are referred to as the dimensions of the dataset. It also shows us the names of the variables (match_id
, player1
, …, return_depth
) and the first few observations for each variable (e.g. the first match in the dataset has id “1301” and was Carlos Alcaraz playing Nicolas Jarry).
glimpse(tennis)
Rows: 7,284
Columns: 46
$ match_id <chr> "2023-wimbledon-1301", "2023-wimbledon-1301", "2023…
$ player1 <chr> "Carlos Alcaraz", "Carlos Alcaraz", "Carlos Alcaraz…
$ player2 <chr> "Nicolas Jarry", "Nicolas Jarry", "Nicolas Jarry", …
$ elapsed_time <time> 00:00:00, 00:00:38, 00:01:01, 00:01:31, 00:02:21, …
$ set_no <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ game_no <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
$ point_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ p1_sets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_sets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_games <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
$ p2_games <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_score <chr> "0", "0", "15", "15", "30", "40", "40", "AD", "40",…
$ p2_score <chr> "0", "15", "15", "30", "30", "30", "40", "40", "40"…
$ server <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
$ serve_no <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ point_victor <dbl> 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, …
$ p1_points_won <dbl> 0, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 8, 8, 9, 9, …
$ p2_points_won <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 7, 7, 8, …
$ game_victor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ set_victor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_ace <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_ace <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ p1_winner <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_winner <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
$ winner_shot_type <chr> "0", "0", "0", "F", "0", "0", "0", "F", "0", "0", "…
$ p1_double_fault <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_double_fault <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_unf_err <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_unf_err <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
$ p1_net_pt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_net_pt <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_net_pt_won <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_net_pt_won <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt_won <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt_won <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt_missed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt_missed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_distance_run <dbl> 6.000, 5.253, 13.800, 51.108, 0.649, 5.291, 6.817, …
$ p2_distance_run <dbl> 7.840, 7.094, 19.808, 75.631, 0.813, 4.249, 17.821,…
$ rally_count <dbl> 2, 1, 4, 13, 1, 2, 1, 6, 7, 5, 1, 4, 4, 3, 1, 2, 1,…
$ speed_mph <dbl> 95, 118, 120, 130, 112, 97, 109, 105, 128, 110, 112…
$ serve_width <chr> "BC", "B", "B", "BW", "W", "BW", "W", "B", "BC", "B…
$ serve_depth <chr> "NCTL", "CTL", "NCTL", "CTL", "NCTL", "NCTL", "CTL"…
$ return_depth <chr> "ND", "ND", "D", "D", NA, "ND", "D", "ND", "D", "ND…
ERROR? Did you get a error message that says could not find function "glimpse"
? This means you need to load the tidyverse
package. You can do this by running the code library(tidyverse)
from the previous code chunk. A shortcut is to hit the “fast-forward” button (next to the “Play” button in your code chunk), which will run all code chunks above your current one.
Tennis Data Overview
Before proceeding with any analysis, let’s make sure we understand what information is contained for key variables (column) in our dataset.
The data set is from the 2023 Men’s Singles Wimbledon Championships, perhaps the most important tennis tournament each year.
Totally new to Tennis? See this site: INTRODUCTION TO TENNIS SCORING
For more information about Wimbledon: Wimbledon Official Site
Variable descriptions
We will actually only use a few columns for this module, but the full description of the data is provided. Some variables have data for both players with columns with labels starting “p1” for player 1 and “p2” for player two. We define these for player 1, but the definitions hold for the corresponding player 2 variables.
Variable | Definition |
---|---|
match_id | match identification |
player1 | first and last name of the first player |
player2 | first and last name of the second player |
elapsed_time | time elapsed since start of first point to start of current point (H:MM:SS) |
set_no | set number in match |
game_no | game number in set |
point_no | point number in game |
p1_sets | sets won by player 1 |
p1_games | games won by player 1 in current set |
p1_score | player 1's score within current game |
server | server of the point |
serve_no | first or second serve |
point_victor | winner of the point |
p1_points_won | number of points won by player 1 in match |
game_victor | a player won a game this point |
set_victor | a player won a set this point |
p1_ace | player 1 hit an untouchable winning serve |
p1_winner | player 1 hit an untouchable winning shot |
winner_shot_type | category of untouchable shot |
p1_double_fault | player 1 missed both serves and lost the point |
p1_unf_err | player 1 made an unforced error |
p1_net_pt | player 1 made it to the net |
p1_net_pt_won | player 1 won the point while at the net |
p1_break_pt | player 1 has an opportunity to win a game player 2 is serving |
p1_break_pt_won | player 1 won the game player 2 is serving |
p1_break_pt_missed | player 1 missed an opportunity to win a game player 2 is serving |
p1_distance_run | player 1's distance ran during point (meters) |
rally_count | number of shots during the point |
speed_mph | speed of serve (miles per hour; mph) |
serve_width | direction of serve |
serve_depth | depth of serve |
return_depth | depth of return |
Viewing your data
You saw that glimpse()
is one way to get a quick look at your data. Often, you’ll want to view your whole dataset. There are two ways to do this:
TIP: Recall that RStudio is split into four quadrants: Source (upper left), Environment (upper right), Console (bottom left), and Files/Plots/Packages/Help/Viewer (bottom right)
- type
View(tennis)
in your Console and then click return/Enter on your keyboard. - OR, in your Environment tab, double click the name of the dataset you want to view.
This will open up your data in a new viewer tab so that you can view it like a spreadsheet (like Google Sheets or Excel*). Once open, you can sort the data by clicking on a column.
*Unlike Google Sheets or Excel, however, you won’t be able to edit the data directly in the spreadsheet.
TIP: Type your answers to each exercise in the .qmd document.
TIP: When viewing the data, clicking on a column once will sort the data according to that variable in ascending order; clicking twice will sort in descending order.
Sample Size Overview
We will use statistical hypothesis testing to help address research questions using data (samples). The tests involve the basic hypotheses:
\(H_0\): there is no effect/difference, the “status quo” (null hypothesis)
\(H_a\): there is an effect/difference (alternative hypothesis)
We are typically hoping to use the data to “reject” the null hypothesis and provide evidence that there is an effect. As a simple example, suppose we develop a new training method to improve serving accuracy in tennis. We will set up an experiment to compare percentage of first serves that are in (accurate) before and after the new training is applied. The hypotheses would be:
\(H_0\): there is no difference in first serve percentage before and after the training (null hypothesis)
\(H_a\): there is a difference (improvement) in first serve percentage after the training (alternative hypothesis)
The question before the experiment is how many samples should we collect in order to test the hypothesis?
Why does sample size matter
There are several things that drive the need for sample size calculations. One is resources. Often it is costly - financially, time, difficulty in getting data - to conduct an experiment so we wish to do so efficiently, with the smallest sample possible. The second is that we want a sample size that will ensure we can get meaningful results from our study. The last thing anyone wants after spending time and money on research is for the results to be inconclusive.
Types of errors
Two types of errors can occur with a statistical hypotheses test.
Type I error: the null hypothesis is true but we reject it.
Type II error: the null hypothesis is false but we fail to reject it.
Balancing the errors
Our goal is to maintain reasonably small chances of both types of errors. The Type I error is typically handled by specifying the probability we make such an error in our hypothesis testing procedure. This probability is usually referred to as \(\alpha\) (“alpha”) and set to a low value. Most typically we use \(\alpha = 0.05\). The Type II error, on the other hand, is NOT specified in the testing. This is where sample size can play a role.
Returning to our example, suppose we collect a sample and the first serve percentage without the training is 50% and after the training it is 75%. Using an \(\alpha = 0.05\) value, however, we do not reject the null hypothesis. We cannot conclude the sample provides evidence of improvement…even though it seems like a rather positive effect!
Our sample was from two games with 4 first serves each…a very small sample. The problem is that with such a small sample we lack power to detect a difference even if it exists. Power is defined as:
\[Power = 1 - \beta\] Where \(\beta\) is the Type II error rate. Thus, we controlled Type I error but our Type II error rate may be too high!
Small samples lead to large uncertainty about the estimates. Consider the 50% estimate for without training. That was based on 2 of 4 successful serves. However, if only one of the serves had been different (say one more success) that percentage would change by 25%!
We thus want a sample large enough that it will reject the null hypothesis when there is a true effect. Generally speaking, resource constraints lead to trying to find the smallest sample size that has adequate power to do so. There is little danger of getting “too large” a sample. You might wonder, though, if resources permit why not just get a super large sample. That would give very high power!
The problem with a very large sample is that there is power to detect very, very small effects. For example, suppose we get a sample of millions of serves. The result is 50% success without the training and 50.1% with the training. The gigantic sample could lead to rejecting the null hypothesis, thus concluding there is evidence of difference due to the training.
The sample size gives us great precision in these estimates…and, after, all technically they are different. However, clearly the effect is not really a difference that any tennis play would care about enough to hire you as their trainer.
Factors impacting sample size (power)
In order to determine the sample size that gets us into the “sweet spot” (not in the sense of hitting a tennis ball) there are four factors that are important in some form for all “power” calculations:
- The sample size (or power)
- note that sometimes we use the power and compute sample size, and sometimes the reverse. If power is used typical choices are \(0.8\) or \(0.9\) (meaning \(\beta = 0.2\) or \(\beta = 0.1\) Type II error rates).
- Your chosen Type I error rate
- Typically we use \(\alpha = 0.05\).
- The amount of variability in the data.
- Variability impacts the precision of estimates.
- How big of a difference (or how strong of an association) you believe exists and is meaningful.
Items 3 and 4 must be estimated in some fashion which is often a challenge for sample size calculation. They are often combined and referred to as an “effect size”. There are various measures of effect size in different settings with rules of thumb for what constitues small, medium, etc. effect sizes that are then used to compute the desired sample size.
Example 1: Power for two sample t-test
Remember our first research question:
We have data for distances run in meters. Our sample will involve those distances on clay and grass courts. The distances are a continuous variable so an appropiate test is two-sample t test. The hypotheses for this test are:
\[H_0: \mu_{clay} = \mu_{grass}\]
\[H_a: \mu_{clay} > \mu_{grass}\]
where \(\mu_{clay}\) is the true average distance run (in meters per point) on clay courts (and similar for \(\mu_{grass}\)).
Note we used “greater than” in the alternative hypothesis implying we believe the amount of running is greater on clay courts. This reflects the general belief that the “slower court” leads to longer points as players are able to reach the ball more easily even if it is hit further from them.
Estimating the Parameters
With the statistical method defined we are then ready to determine the parameter values we will use to perform our sample size computations. We will compute sample size for a desired power, and set the first two parameters to typical values:
Power = \(0.8\)
\(\alpha = 0.05\)
The variability is often estimated from pilot or previous data, or using information from previous studies. We will use our Wimbledon 2023 data. We can compute the standard deviations (sd) for player one run distances and also for player two run distances:
sd(tennis$p1_distance_run)
[1] 13.49286
sd(tennis$p2_distance_run)
[1] 13.60765
Both are similar and around 13.5 meters/point. So, we will choose this value:
- sd = 13.5 meters/point (estimated variability)
The final value we need is the value of the difference in run distance that we would consider meaningful. The mean values for the run distances in the Wimbledon data are:
mean(tennis$p1_distance_run)
[1] 14.00231
mean(tennis$p2_distance_run)
[1] 13.86924
Both are around 14 meters/point. What would represent a meaningful increase for the distance run on clay?
One tool that could help is to consider effect size. Cohen (1988) offers some advice. A metric known as Cohen’s D is one measure and is defined:
\[D = \frac{\mu_2 - \mu_1}{sd}\] where the sd is of the difference in means. If the average distance on clay increases by one standard deviation, then \(D = 1\). In other words, Cohen’s D is the increase in terms of the standard deviation. If \(D = 0.5\) that would be an increase of half of a standard deviation,
An increase of 1 standard deviation (13.5 meters/point) seems large as it would nearly double the average run per point.
The R package “effectsize” contains a function to provide an interpretation of Cohen’s D values. We provide interpretation for the one standard deviation increase (13.5 meters/point) and for smaller increases of 0.5 and 0.25 standard deviations.
interpret_cohens_d(1)
[1] "large"
(Rules: cohen1988)
interpret_cohens_d(0.5)
[1] "medium"
(Rules: cohen1988)
interpret_cohens_d(0.25)
[1] "small"
(Rules: cohen1988)
We will opt for a “medium” effect size using the \(D = 0.5\) value. That would be an increase of one half of a standard deviation: \(0.5 \times 13.5 = 6.75\).
- Difference in means (delta) = 6.75.
Computing the Sample Size
The R command “power.t.test” computes the sample size (or power). We compute sample size by leaving the parameter “n” as “NULL” (note that NULL is the default so we did not need to explicitly specify in running the commmand). We must specify the power in this case; alternatively we could give a value of “n” and make the power NULL to compute power.
Other values are shown below.
Note that the type is “two.sample” because we are comparing two sample means.
The alternative is “one.sided” because we hypothesized and increase (>) in distance run on clay courts. A \(\ne\) alternative hypothesis would be “two.sided”.
power.t.test(n = NULL,
delta = 6.76,
sd = 13.5,
sig.level = 0.05,
power = 0.8,
type = c("two.sample"),
alternative = c("one.sided")
)
Two-sample t test power calculation
n = 50.00462
delta = 6.76
sd = 13.5
sig.level = 0.05
power = 0.8
alternative = one.sided
NOTE: n is number in *each* group
The value returned is n = 50.00462 which is the number per group (so points observed on each of the two court surfaces). We always round up to ensure adequate power so we will need n = 51 points per court surface to conduct our study.
TIP: copy the command in our example and change the parameters as needed to complete each exercise.
Example 2: Power for two sample test of proportions
Now let’s consider our second research question:
We are now considering comparison of proportions (of points won when hitting a first serve into the court). Our sample will involve those proportions on clay and grass courts. The appropiate test is thus the two-sample test of proportions. The hypotheses for this test are:
\[H_0: p_{clay} = p_{grass}\] \[H_a: p_{clay} < p_{grass}\]
where p is the true proportion of first serve points won on the given surface.
Notice that we again chose a one sided alternative, but this time with the proportion on clay courts less than on grass. The slower clay courts are thought to make it easier to return hard first serves. Again, if we did not have prior knowledge about the impact of the surface we would use a \(\ne\) alternative here.
Estimating the Parameters
We will again compute sample size for a desired power, and set the first two parameters to typical values:
Power = \(0.8\)
\(\alpha = 0.05\)
The third parameter, estimate of variability, is not needed for the two sample proportions test. The reason is that the variance for a proportion is actually a function of the proportion. This comes from the variance for a binary (0 or 1) variable which is modeled using the Binomial (Bernoulli) distribution. If the true proportion is p, then the variance is:
\[p \times (1-p)\]
So, once we provide a hypothesized value for p, then the variance can be computed!
The fourth parameter is again the difference we would consider meaningful. We can use our data to get an estimate for the proportion of points won on grass for a first serve.
We first get the percentage for player one. The “filter” function allows us to select only first serve data (“serve_no ==1”) when player one is serving (“server == 1”). We then obtain the table with percentages of which player won the point.
<- tennis |> filter(serve_no == 1 & server == 1)
p1serve1 PercTable(p1serve1$point_victor)
freq perc
1 1'728 76.6%
2 529 23.4%
Since player 1 is the server in this reduced data set, we see that the server wins 76.6% of the points.
We can repeat this for player two (below) and find a similar percentage of 74.3%.
<- tennis |> filter(serve_no == 1 & server == 2)
p2serve1 PercTable(p2serve1$point_victor)
freq perc
1 616 25.7%
2 1'784 74.3%
We select a reasonable percentage then for Wimbledon (grass) of 75%. The question is what would be a noteworthy difference in winning percentage on clay.
We can again consider effect size, and R package “pwr” provides a function “ES.h” that computes an effect size based on two proportions known as Cohen’s H (Cohen, 1988). The rules of thumb for this value are the same as for Cohen’s D, so we can again use the “interpret_cohens_d” function once we obtain a value.
Let’s see what the effect size is if the percentage won on clay is only 50%:
<- ES.h(0.75, 0.5)
prop_effect prop_effect
[1] 0.5235988
interpret_cohens_d(prop_effect)
[1] "medium"
(Rules: cohen1988)
The result is a “medium” effect, but practically that seems like an unlikely change. Even though clay might reduce the serve advantage, it still probably exists. Let’s consider reducing the advantage to 65%:
<- ES.h(0.75, 0.65)
prop_effect prop_effect
[1] 0.2189061
interpret_cohens_d(prop_effect)
[1] "small"
(Rules: cohen1988)
This is a small effect size, but practically certainly meaningful so we will use this in our calculations:
- Difference in proportions: 0.1 (from 0.75 to 0.65)
Computing the Sample Size
The command for proportions is “power.prop.test”. For difference in proportions, we actually input the two proportions rather than the delta. As we will see in the exercises, this matters as the estimate of the variance is based on the hypothesized proportion. The rest of the options are similar to those for the two sample t-test.
power.prop.test(n = NULL,
p1 = 0.75,
p2 = 0.65,
sig.level = 0.05,
power = 0.8,
alternative = "one.sided")
Two-sample comparison of proportions power calculation
n = 258.619
p1 = 0.75
p2 = 0.65
sig.level = 0.05
power = 0.8
alternative = one.sided
NOTE: n is number in *each* group
The value returned is n = 258.619 per group so we will need n = 259 points per court surface to conduct this second study.
REFERENCES
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Ed.). New York: Routledge.