Knowing the effect size

4.9. Knowing the effect size#

Some sleight of hand has been at play in this chapter.

I said that, in power analysis, we assume the alternative hypothesis \(\mathcal{H}_a\) is true, and then simulate data with a particular effect size. But this raises an obvious question: where did that effect size come from?

We started from this situation:

I collect data on end-of-year exam scores in Maths and French for 50 high school students, and calculate Pearson’s \(r\) between Maths and French scores.

\(\mathcal{H}_0\): Under the null hypothesis there is no correlation between Maths and French scores.
\(\mathcal{H}_a\): Under the alternative hypothesis there is a correlation.

But then, to perform a power analysis, we jumped to this assumption:

If \(\mathcal{H}_a\) is true, the population correlation is \(r = 0.25\).

How did we decide to use \(r = 0.25\) as the effect size for the simulated “correlated population”, and therefore for the power calculation?

4.9.1. Post hoc power analysis#

In the example above, I took the value of \(r\) observed in my sample (\(r = 0.25\)) and used it as the effect size for the power analysis.

This approach is sometimes referred to as a post hoc power analysis.

When I ran the power analysis after the fact, it suggested that I would have needed a sample size of about 128 participants (rather than 50) to detect a correlation of this size with 80% power.

This is not really the intended purpose of power analysis, although it is a common way that power analysis is used in practice: to evaluate, after the study has been run, whether it was sufficiently well powered.

Ideally, power analysis should be carried out at the study-planning stage, to determine in advance how large a sample needs to be.

Power calculations conducted before data collection are now required by most funders, ethical review boards, pre-registration platforms, and many journals.

This matters because underpowered studies are a poor use of resources and are less likely to produce reliable, reproducible results.

But this raises a key question: if we want to perform a power analysis before running a study, how can we know what effect size to use?

4.9.2. Video#

Here is a video about how we can decide on the effect size for a power analysis:

%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/c-a1TAX0kBk?si=05b2IWaSpQvjNQHi" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

4.9.3. Estimating the effect size from the literature#

To get an idea of the effect size we expect in a planned study, we can look at other similar studies in the literature. For example, if I want to know whether a new literacy intervention improves reading scores in primary school children, I can look at the effect sizes in previous studies of reading interventions.

4.9.4. Recovering \(d\) from \(t\) and \(n\)#

Although it is not common practice to report effect sizes in journal article, they can be recovered from the \(t\) score and sample size \(n\) as follows.

Paired sample \(t\)-test#

Remember that

\[t = \frac{\bar{x}}{\frac{s_x}{\sqrt{n}}}\]

where \(\bar{x}\) is the mean pairwise difference (eg the mean difference in height between a brother and his own sister) and \(s_x\) is the standard deviation of those differences. \(n\) is the number of pairs.

Now Cohen’s \(d\) is given by a similar formula:

\[d = \frac{\bar{x}}{s_d}\]

Rearranging, we see that

\[d = \frac{t}{\sqrt{n}}\]

In python you can recover d using the following code (in the following example: \(t\) = .75 and \(n\) = 30)

t = 2.8
n = 30

d = (t)/(n**.5)
print(d)

0.511207720338155

One sample \(t\)-test#

This is very similar to the paired sample t-test.

We have

\[t = \frac{\bar{x}-\mu}{\frac{s_{x-\mu}}{\sqrt{n}}}\]

where \(\bar{x}-\mu\) is the mean deviation of each data point from the reference value \(\mu\) (where the reference value might be zero, or some fixed number like the population mean height of men). \(s_{x-\mu}\) is the standard deviation of these deviations. \(n\) is the number of datapoints.

Again we have

\[d = \frac{t}{\sqrt{n}}\]

So in python, this would be the same:

t = 2.8
n = 30

d = (t)/(n**.5)
print(d)

0.511207720338155

Correlation#

Power analysis could be run on the effect size \(r\) directly, but to use statsmodels we convert \(r\) to \(t\) using the formula

\[ t=\frac{r\sqrt{n-2}}{\sqrt{(1-r^2)}} \]

Again we have

\[d = \frac{t}{\sqrt{n}}\]

So if we had: \(r\) =.4 and \(n\) = 50, the code would look like this:

r = 0.4
n = 50
t = (r*(n-2)**0.5)/((1-r**2)**0.5)  # note **2 means 'squared', **0.5 means 'square root')
print(t)

d = (t)/(n**.5)
print(d)

3.023715784073818
0.427617987059879

Independent samples \(t\)-test#

For the independent samples \(t\)-test, we use a similar approach to the paired- and one-sample \(t\)-tests, but need to take into account that there are now two groups sizes \(n_1\) and \(n_2\), and the value of \(s\) in the formula for \(t\) is a combination of the two sample standard deviations \(s_1\) and \(s_2\) into a pooled varaince estimate as follows:

\[s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\]

Yikes!

The formula for \(t\) for the independent samples \(t\)-test is:

\[t = \frac{\bar{x_1}-(\bar{x_2}}{s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

where \(\bar{x_1}\),\(\bar{x_2}\) are the group means and \(n_1, n_2\) are the group standard deviations.

This all means that to recover Cohen’s \(d\) for the independent samples \(t\)-test we need

\[d = t \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\]

Phew!

So in code, if \(t\) = 2.8, \(n1\) = 12, \(n2\)= 10:

t = 2.8
n1 = 12
n2 = 10
d = t*((1/n1 + 1/n2)**0.5)
print(d)

1.1988883740087453

4.9.5. Practical effect size#

One context in which power can definitely be meaningfully defined, is when we know how big an effect would be useful, even if we don’t know what the underlying effect size in the population is.

Say for example we are testing a new analgesic drug. We may not know how much the drug will reduce pain scores (the true effect size) but we can certainly define a minimum effect size that would be clinically meaningful. You could say that you would only consider the effect of the drug clinically significant if there is a 10% change in pain scores (otherwise, the drug won’t be worth taking). That is different from statstistical significance - if you test enough patients you could detect a statstically significant result even for a very small change in clinical outcome but it still wouldn’t mean your drug is an effective painkiller.

If we conduct a power analysis assuming that the effect size in the population is the minimum clincally significant effect, this will tell us how many participants we need to detect such a clinically significant effect with (say) 80% power. By definition a smaller effect would need more participants to detect it (but we wouldn’t be interested in such a small effect from a clinical perspective, so that doesn’t matter). Any effect larger than the minimum clinically significant effect would have more than 80% power, as larger effects are easier to detect.

Knowing the effect size

Contents

4.9. Knowing the effect size#

4.9.1. Post hoc power analysis#

4.9.2. Video#

4.9.3. Estimating the effect size from the literature#

4.9.4. Recovering \(d\) from \(t\) and \(n\)#

Paired sample \(t\)-test#

One sample \(t\)-test#

Correlation#

Independent samples \(t\)-test#

4.9.5. Practical effect size#