Tennis Sample Size

Sample size

Power

Independent two sample t-test

Independent two sample test of proportions

Hypotheses testing

Types of errors

Computing sample size for studies involving tennis

Author

Affiliation

Rodney X. Sturdivant, Ph.D.

Baylor University

Published

July 1, 2024

Facilitation notes

This module would be suitable for an in-class lab or take-home assignment in an introductory statistics course.
It assumes a basic familiarity with the RStudio Environment has already been covered, but no prior programming experiences is expected.
Students should be provided with the following data file (.csv) and Quarto document (.qmd) to produce tests/visualizations and write up their answers to each exercise. Their final deliverable is to turn in an .html document produced by “Rendering” the .qmd.
- data
- Student Quarto template
Posit Cloud (via an Instructor account) or Github classroom are good options for disseminating files to students, but simply uploading files to your university’s course management system works, too.

Welcome video

Introduction

In this module, you will compute the required sample size for studies involving tennis data, using data from Wimbledon 2023 matches to estimate required parameters.

Learning Objectives

By the end of this module, you will be able to:

Read (import) a dataset into your RStudio Environment
Use R to obtain estimates for various parameters from existing data
Compute sample size/power for studies involving:
- Independent two sample t-test
- Independent two sample test of proportions
Understand basic principles for sample size and power analysis

NOTE: R is the name of the programming language itself and RStudio is a convenient interface. To throw even more lingo in, you may be accessing RStudio through a web-based version called Posit Cloud. But R is the programming language you are learning :)

Research Questions

During this lab, you’ll investigate the following research questions:

What sample size is required for a study of the differences in running distances for tennis players on clay and grass courts
What sample size is required for a study of the differences in percentage of points won on the first serve for tennis players on clay and grass courts

Getting started: Tennis data

The first step to any analysis in R is to load necessary packages and data.

You can think of packages like apps on your phone; they extend the functionality and give you access to many more features beyond what comes in the “base package”.

Running the following code will load the tidyverse package, and the tennis data we will be using in this lab.

TIP: As you follow along in the lab, you should run each corresponding code chunk in your .qmd document. To “Run” a code chunk, you can press the green “Play” button in the top right corner of the code chunk in your .qmd. You can also place your cursor anywhere in the line(s) of code you want to run and press “command + return” (Mac) or “Ctrl + Enter” (Windows).

TIP: Using a hashtag in R allows you to add comments to your code (in plain English). Data scientists often use comments to explain what each piece of the code is doing.

library(tidyverse) #loads package
library(DescTools)
library(effectsize)
library(pwr)
tennis <- read_csv("wimbledon_featured_matches.csv") #loads data

We can use the glimpse() function to get a quick look (errr.. glimpse) at our tennis data. The glimpse code provides the number of observations (Rows) and the number of variables (Columns) in the dataset. The “Rows” and “Columns” are referred to as the dimensions of the dataset. It also shows us the names of the variables (match_id, player1, …, return_depth) and the first few observations for each variable (e.g. the first match in the dataset has id “1301” and was Carlos Alcaraz playing Nicolas Jarry).

glimpse(tennis)

Rows: 7,284
Columns: 46
$ match_id           <chr> "2023-wimbledon-1301", "2023-wimbledon-1301", "2023…
$ player1            <chr> "Carlos Alcaraz", "Carlos Alcaraz", "Carlos Alcaraz…
$ player2            <chr> "Nicolas Jarry", "Nicolas Jarry", "Nicolas Jarry", …
$ elapsed_time       <time> 00:00:00, 00:00:38, 00:01:01, 00:01:31, 00:02:21, …
$ set_no             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ game_no            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
$ point_no           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ p1_sets            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_sets            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_games           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
$ p2_games           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_score           <chr> "0", "0", "15", "15", "30", "40", "40", "AD", "40",…
$ p2_score           <chr> "0", "15", "15", "30", "30", "30", "40", "40", "40"…
$ server             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
$ serve_no           <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, …
$ point_victor       <dbl> 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, …
$ p1_points_won      <dbl> 0, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 8, 8, 9, 9, …
$ p2_points_won      <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 7, 7, 8, …
$ game_victor        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ set_victor         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_ace             <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_ace             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ p1_winner          <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_winner          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
$ winner_shot_type   <chr> "0", "0", "0", "F", "0", "0", "0", "F", "0", "0", "…
$ p1_double_fault    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_double_fault    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_unf_err         <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_unf_err         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
$ p1_net_pt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_net_pt          <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_net_pt_won      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_net_pt_won      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt_won    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt_won    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_break_pt_missed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p2_break_pt_missed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p1_distance_run    <dbl> 6.000, 5.253, 13.800, 51.108, 0.649, 5.291, 6.817, …
$ p2_distance_run    <dbl> 7.840, 7.094, 19.808, 75.631, 0.813, 4.249, 17.821,…
$ rally_count        <dbl> 2, 1, 4, 13, 1, 2, 1, 6, 7, 5, 1, 4, 4, 3, 1, 2, 1,…
$ speed_mph          <dbl> 95, 118, 120, 130, 112, 97, 109, 105, 128, 110, 112…
$ serve_width        <chr> "BC", "B", "B", "BW", "W", "BW", "W", "B", "BC", "B…
$ serve_depth        <chr> "NCTL", "CTL", "NCTL", "CTL", "NCTL", "NCTL", "CTL"…
$ return_depth       <chr> "ND", "ND", "D", "D", NA, "ND", "D", "ND", "D", "ND…

ERROR? Did you get a error message that says could not find function "glimpse"? This means you need to load the tidyverse package. You can do this by running the code library(tidyverse) from the previous code chunk. A shortcut is to hit the “fast-forward” button (next to the “Play” button in your code chunk), which will run all code chunks above your current one.

Tennis Data Overview

Before proceeding with any analysis, let’s make sure we understand what information is contained for key variables (column) in our dataset.

The data set is from the 2023 Men’s Singles Wimbledon Championships, perhaps the most important tennis tournament each year.

Basic Features of the Wimbledon 2023 data

Wimbledon is a single elimination tournament
- Round 1 begins with 128 players; 64 matches are played with the 64 winning players advancing to the second round.
- Subsequent rounds are played until two players reach the final; the winner of this final round is the Wimbledon Champion
Our tennis data set includes all matches after the second round.
The data provides information for every point played in the matches
- Each row represents a point
- Points are ordered within each match from the first to last point

Totally new to Tennis? See this site: INTRODUCTION TO TENNIS SCORING

For more information about Wimbledon: Wimbledon Official Site

Variable descriptions

We will actually only use a few columns for this module, but the full description of the data is provided. Some variables have data for both players with columns with labels starting “p1” for player 1 and “p2” for player two. We define these for player 1, but the definitions hold for the corresponding player 2 variables.

Variable	Definition
match_id	match identification
player1	first and last name of the first player
player2	first and last name of the second player
elapsed_time	time elapsed since start of first point to start of current point (H:MM:SS)
set_no	set number in match
game_no	game number in set
point_no	point number in game
p1_sets	sets won by player 1
p1_games	games won by player 1 in current set
p1_score	player 1's score within current game
server	server of the point
serve_no	first or second serve
point_victor	winner of the point
p1_points_won	number of points won by player 1 in match
game_victor	a player won a game this point
set_victor	a player won a set this point
p1_ace	player 1 hit an untouchable winning serve
p1_winner	player 1 hit an untouchable winning shot
winner_shot_type	category of untouchable shot
p1_double_fault	player 1 missed both serves and lost the point
p1_unf_err	player 1 made an unforced error
p1_net_pt	player 1 made it to the net
p1_net_pt_won	player 1 won the point while at the net
p1_break_pt	player 1 has an opportunity to win a game player 2 is serving
p1_break_pt_won	player 1 won the game player 2 is serving
p1_break_pt_missed	player 1 missed an opportunity to win a game player 2 is serving
p1_distance_run	player 1's distance ran during point (meters)
rally_count	number of shots during the point
speed_mph	speed of serve (miles per hour; mph)
serve_width	direction of serve
serve_depth	depth of serve
return_depth	depth of return

Viewing your data

You saw that glimpse() is one way to get a quick look at your data. Often, you’ll want to view your whole dataset. There are two ways to do this:

TIP: Recall that RStudio is split into four quadrants: Source (upper left), Environment (upper right), Console (bottom left), and Files/Plots/Packages/Help/Viewer (bottom right)

type View(tennis) in your Console and then click return/Enter on your keyboard.
OR, in your Environment tab, double click the name of the dataset you want to view.

This will open up your data in a new viewer tab so that you can view it like a spreadsheet (like Google Sheets or Excel*). Once open, you can sort the data by clicking on a column.

*Unlike Google Sheets or Excel, however, you won’t be able to edit the data directly in the spreadsheet.

Exercise 1

View the tennis data and sort it appropriately to answer the following questions:

We will be interested in the distances run. Which variables in the data set contain this data? What are example values (and units) for this data? What are the smallest and largest values?
The second research question involves the percentage of points won on the first serve. What variables provide information related to this question? What type of variables are they and what are the possible values for each?

TIP: Type your answers to each exercise in the .qmd document.

TIP: When viewing the data, clicking on a column once will sort the data according to that variable in ascending order; clicking twice will sort in descending order.

Sample Size Overview

We will use statistical hypothesis testing to help address research questions using data (samples). The tests involve the basic hypotheses:

\(H_0\): there is no effect/difference, the “status quo” (null hypothesis)

\(H_a\): there is an effect/difference (alternative hypothesis)

We are typically hoping to use the data to “reject” the null hypothesis and provide evidence that there is an effect. As a simple example, suppose we develop a new training method to improve serving accuracy in tennis. We will set up an experiment to compare percentage of first serves that are in (accurate) before and after the new training is applied. The hypotheses would be:

\(H_0\): there is no difference in first serve percentage before and after the training (null hypothesis)

\(H_a\): there is a difference (improvement) in first serve percentage after the training (alternative hypothesis)

The question before the experiment is how many samples should we collect in order to test the hypothesis?

Why does sample size matter

There are several things that drive the need for sample size calculations. One is resources. Often it is costly - financially, time, difficulty in getting data - to conduct an experiment so we wish to do so efficiently, with the smallest sample possible. The second is that we want a sample size that will ensure we can get meaningful results from our study. The last thing anyone wants after spending time and money on research is for the results to be inconclusive.

Types of errors

Two types of errors can occur with a statistical hypotheses test.

Type I error: the null hypothesis is true but we reject it.
Type II error: the null hypothesis is false but we fail to reject it.

Important

The errors are a trade off - if we improve one, the other is worse.

Exercise 2

For our proposed study of a method to improve first serve perentage:

What would it mean to have a Type I error?
What would it mean to have a Type II error?
Is one error “worse” than the other in this case? Explain.

Balancing the errors

Our goal is to maintain reasonably small chances of both types of errors. The Type I error is typically handled by specifying the probability we make such an error in our hypothesis testing procedure. This probability is usually referred to as \(\alpha\) (“alpha”) and set to a low value. Most typically we use \(\alpha = 0.05\). The Type II error, on the other hand, is NOT specified in the testing. This is where sample size can play a role.

Returning to our example, suppose we collect a sample and the first serve percentage without the training is 50% and after the training it is 75%. Using an \(\alpha = 0.05\) value, however, we do not reject the null hypothesis. We cannot conclude the sample provides evidence of improvement…even though it seems like a rather positive effect!

Our sample was from two games with 4 first serves each…a very small sample. The problem is that with such a small sample we lack power to detect a difference even if it exists. Power is defined as:

\[Power = 1 - \beta\] Where \(\beta\) is the Type II error rate. Thus, we controlled Type I error but our Type II error rate may be too high!

Small samples lead to large uncertainty about the estimates. Consider the 50% estimate for without training. That was based on 2 of 4 successful serves. However, if only one of the serves had been different (say one more success) that percentage would change by 25%!

Important

A study that has too small of a sample size to detect a meaningful effect is said to be under powered. The Type II error rate is very high.

We thus want a sample large enough that it will reject the null hypothesis when there is a true effect. Generally speaking, resource constraints lead to trying to find the smallest sample size that has adequate power to do so. There is little danger of getting “too large” a sample. You might wonder, though, if resources permit why not just get a super large sample. That would give very high power!

The problem with a very large sample is that there is power to detect very, very small effects. For example, suppose we get a sample of millions of serves. The result is 50% success without the training and 50.1% with the training. The gigantic sample could lead to rejecting the null hypothesis, thus concluding there is evidence of difference due to the training.

The sample size gives us great precision in these estimates…and, after, all technically they are different. However, clearly the effect is not really a difference that any tennis play would care about enough to hire you as their trainer.

Important

A study that has too large of a sample size to detect a meaningful effect is said to be over powered. The Type II error rate is so low that meaningless effects are said to be significant.

Factors impacting sample size (power)

In order to determine the sample size that gets us into the “sweet spot” (not in the sense of hitting a tennis ball) there are four factors that are important in some form for all “power” calculations:

The sample size (or power)

note that sometimes we use the power and compute sample size, and sometimes the reverse. If power is used typical choices are \(0.8\) or \(0.9\) (meaning \(\beta = 0.2\) or \(\beta = 0.1\) Type II error rates).

Your chosen Type I error rate

Typically we use \(\alpha = 0.05\).

The amount of variability in the data.

Variability impacts the precision of estimates.

How big of a difference (or how strong of an association) you believe exists and is meaningful.

Items 3 and 4 must be estimated in some fashion which is often a challenge for sample size calculation. They are often combined and referred to as an “effect size”. There are various measures of effect size in different settings with rules of thumb for what constitues small, medium, etc. effect sizes that are then used to compute the desired sample size.

Example 1: Power for two sample t-test

Remember our first research question:

Research question

What sample size is required for a study of the differences in running distances for tennis players on clay and grass courts

We have data for distances run in meters. Our sample will involve those distances on clay and grass courts. The distances are a continuous variable so an appropiate test is two-sample t test. The hypotheses for this test are:

\[H_0: \mu_{clay} = \mu_{grass}\]

\[H_a: \mu_{clay} > \mu_{grass}\]

where \(\mu_{clay}\) is the true average distance run (in meters per point) on clay courts (and similar for \(\mu_{grass}\)).

Note we used “greater than” in the alternative hypothesis implying we believe the amount of running is greater on clay courts. This reflects the general belief that the “slower court” leads to longer points as players are able to reach the ball more easily even if it is hit further from them.

Estimating the Parameters

With the statistical method defined we are then ready to determine the parameter values we will use to perform our sample size computations. We will compute sample size for a desired power, and set the first two parameters to typical values:

Power = \(0.8\)
\(\alpha = 0.05\)

The variability is often estimated from pilot or previous data, or using information from previous studies. We will use our Wimbledon 2023 data. We can compute the standard deviations (sd) for player one run distances and also for player two run distances:

sd(tennis$p1_distance_run)

[1] 13.49286

sd(tennis$p2_distance_run)

[1] 13.60765

Both are similar and around 13.5 meters/point. So, we will choose this value:

sd = 13.5 meters/point (estimated variability)

The final value we need is the value of the difference in run distance that we would consider meaningful. The mean values for the run distances in the Wimbledon data are:

mean(tennis$p1_distance_run)

[1] 14.00231

mean(tennis$p2_distance_run)

[1] 13.86924

Both are around 14 meters/point. What would represent a meaningful increase for the distance run on clay?

One tool that could help is to consider effect size. Cohen (1988) offers some advice. A metric known as Cohen’s D is one measure and is defined:

\[D = \frac{\mu_2 - \mu_1}{sd}\] where the sd is of the difference in means. If the average distance on clay increases by one standard deviation, then \(D = 1\). In other words, Cohen’s D is the increase in terms of the standard deviation. If \(D = 0.5\) that would be an increase of half of a standard deviation,

An increase of 1 standard deviation (13.5 meters/point) seems large as it would nearly double the average run per point.

The R package “effectsize” contains a function to provide an interpretation of Cohen’s D values. We provide interpretation for the one standard deviation increase (13.5 meters/point) and for smaller increases of 0.5 and 0.25 standard deviations.

interpret_cohens_d(1)

[1] "large"
(Rules: cohen1988)

interpret_cohens_d(0.5)

[1] "medium"
(Rules: cohen1988)

interpret_cohens_d(0.25)

[1] "small"
(Rules: cohen1988)

We will opt for a “medium” effect size using the \(D = 0.5\) value. That would be an increase of one half of a standard deviation: \(0.5 \times 13.5 = 6.75\).

Difference in means (delta) = 6.75.

Computing the Sample Size

The R command “power.t.test” computes the sample size (or power). We compute sample size by leaving the parameter “n” as “NULL” (note that NULL is the default so we did not need to explicitly specify in running the commmand). We must specify the power in this case; alternatively we could give a value of “n” and make the power NULL to compute power.

Other values are shown below.

Note that the type is “two.sample” because we are comparing two sample means.

The alternative is “one.sided” because we hypothesized and increase (>) in distance run on clay courts. A \(\ne\) alternative hypothesis would be “two.sided”.

power.t.test(n = NULL, 
             delta = 6.76, 
             sd = 13.5, 
             sig.level = 0.05,
             power = 0.8,
             type = c("two.sample"),
             alternative = c("one.sided")
             )


     Two-sample t test power calculation 

              n = 50.00462
          delta = 6.76
             sd = 13.5
      sig.level = 0.05
          power = 0.8
    alternative = one.sided

NOTE: n is number in *each* group

The value returned is n = 50.00462 which is the number per group (so points observed on each of the two court surfaces). We always round up to ensure adequate power so we will need n = 51 points per court surface to conduct our study.

Exercise 3

In the next exercises, we will examine how various parameters impact the required sample size. Let’s look first at the impact of power.

Rerun the sample size calculation increasing the desired power to 0.9. What sample size is required?
Rerun the sample size calculation decreasing the desired power to 0.7. What sample size is required?
Does increasing the power impact the required sample size? Why?

TIP: copy the command in our example and change the parameters as needed to complete each exercise.

Exercise 4

Impact of (alpha). Return to a power of 0.8 for this exercise.

Rerun the sample size calculation decreasing the desired alpha to 0.01. What sample size is required?
Rerun the sample size calculation increasing the desired alpha to 0.1. What sample size is required?
How does changing the allowable Type I error impact the required sample size? Why?

Exercise 5

Impact of variability, sd. Return to a power of 0.8 and alpha of 0.05

Increase the sd to 15. What sample size is required?
Increase the sd to 10. What sample size is required?
How does the estimated variability in the data impact the required sample size? Why?

Exercise 6

Impact of size of the difference, delta. Return to a power of 0.8, alpha of 0.05, and sd of 13.5.

What value of delta leads to a small effect of D = 0.25 standard deviation increase?
What value of delta produces a large effect of D = 1 sd increase?
Find the sample size for the delta values computed in a and b.
How does the effect size impact the required sample size? Why?

Exercise 7

Computing power instead of sample size. Use the original values in the example (delta = 6.75, sd = 13.5, alpha = 0.05). We computed a sample size of 51 to achieve power of 0.8 in the example.

Modify the command by setting power to “NULL” and the sample size to n = 51. Does the computed power exceed 0.8?
What is the power if n = 50 (recall we rounded up)?

Example 2: Power for two sample test of proportions

Now let’s consider our second research question:

Research question

What sample size is required for a study of the differences in percentage of points won on the first serve for tennis players on clay and grass courts?

We are now considering comparison of proportions (of points won when hitting a first serve into the court). Our sample will involve those proportions on clay and grass courts. The appropiate test is thus the two-sample test of proportions. The hypotheses for this test are:

\[H_0: p_{clay} = p_{grass}\] \[H_a: p_{clay} < p_{grass}\]

where p is the true proportion of first serve points won on the given surface.

Notice that we again chose a one sided alternative, but this time with the proportion on clay courts less than on grass. The slower clay courts are thought to make it easier to return hard first serves. Again, if we did not have prior knowledge about the impact of the surface we would use a \(\ne\) alternative here.

Estimating the Parameters

We will again compute sample size for a desired power, and set the first two parameters to typical values:

Power = \(0.8\)
\(\alpha = 0.05\)

The third parameter, estimate of variability, is not needed for the two sample proportions test. The reason is that the variance for a proportion is actually a function of the proportion. This comes from the variance for a binary (0 or 1) variable which is modeled using the Binomial (Bernoulli) distribution. If the true proportion is p, then the variance is:

\[p \times (1-p)\]

So, once we provide a hypothesized value for p, then the variance can be computed!

The fourth parameter is again the difference we would consider meaningful. We can use our data to get an estimate for the proportion of points won on grass for a first serve.

We first get the percentage for player one. The “filter” function allows us to select only first serve data (“serve_no ==1”) when player one is serving (“server == 1”). We then obtain the table with percentages of which player won the point.

p1serve1 <- tennis |> filter(serve_no == 1 & server == 1)
PercTable(p1serve1$point_victor)

               
    freq   perc
               
1  1'728  76.6%
2    529  23.4%

Since player 1 is the server in this reduced data set, we see that the server wins 76.6% of the points.

We can repeat this for player two (below) and find a similar percentage of 74.3%.

p2serve1 <- tennis |> filter(serve_no == 1 & server == 2)
PercTable(p2serve1$point_victor)

               
    freq   perc
               
1    616  25.7%
2  1'784  74.3%

We select a reasonable percentage then for Wimbledon (grass) of 75%. The question is what would be a noteworthy difference in winning percentage on clay.

We can again consider effect size, and R package “pwr” provides a function “ES.h” that computes an effect size based on two proportions known as Cohen’s H (Cohen, 1988). The rules of thumb for this value are the same as for Cohen’s D, so we can again use the “interpret_cohens_d” function once we obtain a value.

Let’s see what the effect size is if the percentage won on clay is only 50%:

prop_effect <- ES.h(0.75, 0.5)
prop_effect

[1] 0.5235988

interpret_cohens_d(prop_effect)

[1] "medium"
(Rules: cohen1988)

The result is a “medium” effect, but practically that seems like an unlikely change. Even though clay might reduce the serve advantage, it still probably exists. Let’s consider reducing the advantage to 65%:

prop_effect <- ES.h(0.75, 0.65)
prop_effect

[1] 0.2189061

interpret_cohens_d(prop_effect)

[1] "small"
(Rules: cohen1988)

This is a small effect size, but practically certainly meaningful so we will use this in our calculations:

Difference in proportions: 0.1 (from 0.75 to 0.65)

Computing the Sample Size

The command for proportions is “power.prop.test”. For difference in proportions, we actually input the two proportions rather than the delta. As we will see in the exercises, this matters as the estimate of the variance is based on the hypothesized proportion. The rest of the options are similar to those for the two sample t-test.

power.prop.test(n = NULL, 
                p1 = 0.75, 
                p2 = 0.65, 
                sig.level = 0.05, 
                power = 0.8,
                alternative = "one.sided")


     Two-sample comparison of proportions power calculation 

              n = 258.619
             p1 = 0.75
             p2 = 0.65
      sig.level = 0.05
          power = 0.8
    alternative = one.sided

NOTE: n is number in *each* group

The value returned is n = 258.619 per group so we will need n = 259 points per court surface to conduct this second study.

Exercise 8

As mentioned in the example, the hypothesized proportion impacts the estimate of the variance. We explore this impact in this exercise.

Modify the sample size calculation so that the effect is still 0.1 but based on 70% for p1 and 60% for p2. What is the resulting sample size? How does this compare to the sample size in the example using 75% and 60%?
The variance estimate is related to the value \(p \times (1-p)\). What is this value if p = 0.75? What is the value when p = 0.7? Which variance is larger?
Based on the results from part b, and from your exploration of the role of variability in sample size calculations for the two sample t-test, explain the change in sample size in part a.

Exercise 9

Formulate an additional question that involves two sample means using variables available in the data set and compute sample size in similar fashion to example 1.

Exercise 10

Formulate an additional question that involves two sample proportions using variables available in the data set and compute sample size in similar fashion to example 2.

REFERENCES

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Ed.). New York: Routledge.