Power

Changed by Stefan Tigges, 3 Dec 2023
Disclosures - updated 10 May 2023: Nothing to disclose

Updates to Article Attributes

Body was changed:

The Powerpower  isof a critical concept when planning or evaluatingclinical trial is the probability that the trial will find a radiology studydifference between groups if there is one. Power can be defined as the probability of a true positive trial result and is often written as:

  • power = (1 -β)

Conventionallywhere β is the probability of missing a difference between groups. Most clinical trials aim for a power of 80% or an 80% chance of finding a difference if there is one. Using the definition of power above, this means that many clinical trials have a β of 20% or a 20% chance of missing a real difference, i.e., having a false negative trial result.

For Radiologists, a useful analogy is that power is set at 0.80-0.85. For radiologistslike sensitivity: both power and sensitivity are measures of the likelihood of finding something, it may be useful to thinka difference between groups in the case of power as being similar to sensitivity: power isand disease in the abilitycase of a diagnostic test.  

A variety of factors determine the power of a study including the size and variability of the effect under study, the level of alpha (α), and the sample size.

Effect size

The larger the effect under study, the easier it is to detectrecognise, and the higher the power. For example, the effect of a difference between two or more treatments / diagnostic studies if there reallynew chemotherapeutic agent that resulted in a 100% cure rate of a previously incurable cancer would be easy to identify.

 Effect variability

The less variable the effect under study is, the easier it is a differenceto recognise, and diagnostic test sensitivitythe higher the power. For example, the effect of a new weight loss agent that resulted in a uniform loss of 10 kilograms among all study participants would be easier to identify than one that resulted in a range of weight loss with a high standard deviation. For Radiologists, a useful analogy is the ability of a test to detect disease if it is present.

Concept

When reviewing a samplesignal-to-noise ratio: the noisier the data set, the meanless likely one is to recognise the signal.

Level of α

The effect of the value in questionlevel of the experimental populationα is likely different fromthe least intuitive factor affecting power. Recall that α is the overall population. In other words, after you do something to your experimental sample, you expectpre-determined threshold for rejecting the variable you're watching to change.

For instance, imagine you want to evaluate the size of the pancreatic duct after administering secretin. You think secretin will increase the size of the pancreatic duct. Pathologic data is your gold standard and using this technique, it's been shown that an average adult patient has a mean size duct (μnull hypothesis (H0) with a standard deviation (σ). Now you give a sample population a bolus of secretin and start counting...

The mean size of the pancreatic duct from the post-secretin experimental population (X) is greater than expected from the path data... but is this a real effect or is this just due to chance?

If we set the p-value to 0.05, then we know we have only a 5% chance of making a type I error (α)... the error of saying that the increase in size of the pancreatic duct is a real effect, when it really is not. Although not strictly identical, radiologists might find it useful to think of type I errors as false positives.

If our study does not show that the increase is significant to the 0.05 level, then we cannot believe that secretin made a difference, but we run the risk of a type II error (β)... the error of saying that there is no difference whenbetween groups, customarily set at .05. If the p-value of a study is below .05, then H0 is rejected; if the p-value of a study is above .05, the conclusion is that there reallyis insufficient evidence to reject H0. If power is. Again analogous to sensitivity, although not strictly identicalthen α is analogous to 1-specificity or the false positive rate. By setting α at .05, radiologists might findwe accept a false positive rate of 5%. If we decrease α to .01, we decrease our false positive rate to 1%, but in doing so, we increase the β or false negative rate, decreasing power. A decrease in α means that we are less likely to reject H0. This protects us from false positive results but will increase the number of times we fail to reject H0 when it usefulis incorrect, decreasing the number of true positives. On the other hand, increasing α makes it easier to thinkreject H0, increasing the number of type II errors asfalse positives and increasing the number of true positives and thus power.

Consider the 2 extremes. If α=0, H0 will never be rejected and all study results will be negative: there will be no false positives (good) but also no true positives (bad), resulting in a power of 0. If α=1, H0 will always be rejected and all study results will be positive: the only possible outcomes will be false positives (bad) and true positives (good), resulting in a power of 1. Increasing the number of negative results increases both true negatives and false negatives while increasing the number of positive results increases both true positives and false positives. More false negatives decreases power, more true positives increases power. 

A more intuitive analogy compares the level of α to the amount of evidence required to convict a defendant at trial. The question then becomeslower the α, how do we know we have enough peoplethe more evidence is needed to convict, resulting in our experimental groupfewer false positive convictions but also fewer true positive convictions, decreasing power. The higher the α, the less evidence is needed to show a difference if there were one? Thisconvict, resulting in more false positive convictions but also more true positive convictions, increasing power.

Sample size

Power increases as sample size increases. Increasing the sample size decreases the standard deviation which decreases variability. Because effect size and variability are dependent on biology and α is how you powera study effectively. Ifalmost universally set at .05, the sample size is too small (underpowered)the easiest clinical trial parameter for investigators to control.

Again, thenan analogy makes the riskeffect of a type II error increases.

You can imagine two tests for the pancreatic duct: one with 15 post-secretin patients and one with 115 post-secretin patients. Both may failsample size easier to meet the p-valuegrasp. A coin that landed heads on 3 straight consecutive tosses would be unlikely to raise suspicion of an unfair coin, but intuitively we know that the second test did30 or even 300 heads in a better job of trying to show a difference if there were one.  This is what we're trying to capture with the concept of power.

The key variables in powerrow are

  • the size convincing evidence of the difference you expect betweencoin’s unfairness.

    Caveat

    It’s important to remember that this discussion of power only considers the two groups

    • the smaller the difference you expect between the experimentalinfluence of random factors and control group, the larger numberexcludes consideration of subjects you need to tease outother factors that small difference

    • an analogy that may help is the children's game "Where's Wally" or "Where's Waldo" where Wally/Waldo is the effect: if Wally were large, he would be easier to see

  • the number of patients

  • the variability of the data,influence clinical trial results such as captured by the standard deviationbias.

    • the more variable (noisy) the data, the more patients you will need to show a difference

    • again an analogy with the children's game "Where's Wally" or "Where's Waldo" may help: if Wally were all by himself (no variability), he would be easier to recognise

  • the alpha cut off (usually 0.05)

Post hoc power analysis

The use of post hoc power analysis (i.e. calculating power after the study has concluded) is controversial as it is thought to be unreliable 2. 

  • -<p><strong>Power</strong>&nbsp;is a critical concept when planning or evaluating a radiology study:</p><ul><li><p>power = (1 -&nbsp;β)</p></li></ul><p>Conventionally, power is set at 0.80-0.85. For radiologists, it may be useful to think of power as being similar to sensitivity: power is the ability of a study to detect a difference between two or more treatments / diagnostic studies if there really is a difference and diagnostic test sensitivity is the ability of a test to detect disease if it is present.</p><h4>Concept</h4><p>When reviewing a sample data set, the mean of the value in question of the experimental population is likely different from the overall population. In other words, after you do something to your experimental sample, you expect the variable you're watching to change.</p><p>For instance, imagine you want to evaluate the size of the <a href="/articles/pancreatic-ducts">pancreatic duct</a> after administering secretin. You think secretin will increase the size of the pancreatic duct.&nbsp;Pathologic data is your gold standard and using this technique, it's been shown that an average adult patient has a mean size duct (μ)&nbsp;with a standard deviation (σ). Now you give a sample population a bolus of secretin and start counting...</p><p>The mean size of the pancreatic duct from the post-secretin experimental population (X)&nbsp;is greater than expected from the path data... but is this a real effect or is this just due to chance?</p><p>If we set the <a href="/articles/p-value-1">p-value</a> to 0.05, then we know we have only a 5% chance of making a <a href="/articles/type-i-error">type I error</a>&nbsp;(α)... the error of saying that the increase in size of the pancreatic duct is a real effect, when it really is not. Although not strictly identical, radiologists might find it useful to think of <a href="/articles/type-i-error">type I error</a>s as false positives.</p><p>If our study does not show that the increase is significant to the 0.05 level, then we cannot believe that secretin made a difference, but we run the risk of a <a href="/articles/type-ii-error">type II error</a>&nbsp;(β)... the error of saying that there is no difference when there really is.&nbsp;Again, although not strictly identical, radiologists might find it useful to think of <a href="/articles/type-ii-error">type II error</a>s as false negatives.</p><p>The question then becomes, how do we know we have enough people in our experimental group to show a difference if there were one? This is how you power<em> </em>a study effectively. If the sample size is too small (underpowered), then the risk of a type II error increases.</p><p>You can imagine two tests for the pancreatic duct: one with 15 post-secretin patients and one with 115 post-secretin patients. Both may fail to meet the p-value, but intuitively we know that the second test did a better job of trying to show a difference if there were one. &nbsp;This is what we're trying to capture with the concept of power.</p><p>The key variables in power are</p><ul>
  • -<li>
  • -<p>the size of the difference you expect between the two groups</p>
  • -<ul>
  • -<li><p>the smaller the difference you expect between the experimental and control group, the larger number of subjects you need to tease out that small difference</p></li>
  • -<li><p>an analogy that may help is the children's game "Where's Wally" or "Where's Waldo" where Wally/Waldo is the effect: if Wally were large, he would be easier to see</p></li>
  • -</ul>
  • -</li>
  • -<li><p>the number of patients</p></li>
  • -<li>
  • -<p>the variability of the data, as captured by the standard deviation</p>
  • -<ul>
  • -<li><p>the more variable (noisy) the data, the more patients you will need to show a difference</p></li>
  • -<li><p>again an analogy with the children's game "Where's Wally" or "Where's Waldo" may help: if Wally were all by himself (no variability), he would be easier to recognise</p></li>
  • -</ul>
  • -</li>
  • -<li><p>the alpha cut off (usually 0.05)</p></li>
  • -</ul><h5>Post hoc power analysis</h5><p>The use of post hoc power analysis (i.e. calculating power after the study has concluded) is controversial as it is thought to be unreliable <sup>2</sup>.&nbsp;</p>
  • +<p>The <strong>power </strong>of a clinical trial is the probability that the trial will find a difference between groups if there is one. Power can be defined as the probability of a true positive trial result and is often written as:</p><ul><li><p>power = (1 - β)</p></li></ul><p>where β is the probability of missing a difference between groups. Most clinical trials aim for a power of 80% or an 80% chance of finding a difference if there is one. Using the definition of power above, this means that many clinical trials have a β of 20% or a 20% chance of missing a real difference, i.e., having a false negative trial result.</p><p>For Radiologists, a useful analogy is that power is like <a href="/articles/sensitivity" title="Sensitivity">sensitivity</a>: both power and sensitivity are measures of the likelihood of finding something, a difference between groups in the case of power and disease in the case of a diagnostic test. &nbsp;</p><p>A variety of factors determine the power of a study including the size and variability of the effect under study, the level of alpha (α), and the sample size.</p><h4>Effect size</h4><p>The larger the effect under study, the easier it is to recognise, and the higher the power. For example, the effect of a new chemotherapeutic agent that resulted in a 100% cure rate of a previously incurable cancer would be easy to identify.</p><h4>&nbsp;Effect variability</h4><p>The less variable the effect under study is, the easier it is to recognise, and the higher the power. For example, the effect of a new weight loss agent that resulted in a uniform loss of 10 kilograms among all study participants would be easier to identify than one that resulted in a range of weight loss with a high standard deviation. For Radiologists, a useful analogy is the signal-to-noise ratio: the noisier the data, the less likely one is to recognise the signal.</p><h4>Level of α</h4><p>The effect of the level of α is the least intuitive factor affecting power. Recall that α is the pre-determined threshold for rejecting the null hypothesis (H0) that there is no difference between groups, customarily set at .05. If the p-value of a study is below .05, then H0 is rejected; if the p-value of a study is above .05, the conclusion is that there is insufficient evidence to reject H0. If power is analogous to sensitivity, then α is analogous to 1-<a href="/articles/specificity" title="Specificity">specificity </a>or the false positive rate. By setting α at .05, we accept a false positive rate of 5%. If we decrease α to .01, we decrease our false positive rate to 1%, but in doing so, we increase the β or false negative rate, decreasing power. A decrease in α means that we are less likely to reject H0. This protects us from false positive results but will increase the number of times we fail to reject H0 when it is incorrect, decreasing the number of true positives. On the other hand, increasing α makes it easier to reject H0, increasing the number of false positives and increasing the number of true positives and thus power.</p><p>Consider the 2 extremes. If α=0, H0 will never be rejected and all study results will be negative: there will be no false positives (good) but also no true positives (bad), resulting in a power of 0. If α=1, H0 will always be rejected and all study results will be positive: the only possible outcomes will be false positives (bad) and true positives (good), resulting in a power of 1. Increasing the number of negative results increases both true negatives and false negatives while increasing the number of positive results increases both true positives and false positives. More false negatives decreases power, more true positives increases power.&nbsp;</p><p>A more intuitive analogy compares the level of α to the amount of evidence required to convict a defendant at trial. The lower the α, the more evidence is needed to convict, resulting in fewer false positive convictions but also fewer true positive convictions, decreasing power. The higher the α, the less evidence is needed to convict, resulting in more false positive convictions but also more true positive convictions, increasing power.</p><h4>Sample size</h4><p>Power increases as sample size increases. Increasing the sample size decreases the standard deviation which decreases variability. Because effect size and variability are dependent on biology and α is almost universally set at .05, the sample size is the easiest clinical trial parameter for investigators to control.</p><p>Again, an analogy makes the effect of sample size easier to grasp. A coin that landed heads on 3 straight consecutive tosses would be unlikely to raise suspicion of an unfair coin, but 30 or even 300 heads in a row are convincing evidence of the coin’s unfairness.</p><h4>Caveat</h4><p>It’s important to remember that this discussion of power only considers the influence of random factors and excludes consideration of other factors that influence clinical trial results such as bias.</p><p>&nbsp;</p>

Tags changed:

  • research

ADVERTISEMENT: Supporters see fewer/no ads

Updating… Please wait.

 Unable to process the form. Check for errors and try again.

 Thank you for updating your details.