5 – Effect sizes



5 – Effect sizes

Meta-analyses are by definition quantitative, and thus you must have quantitative data with which to test your research question. To do this, you must decide what the effect is that you are investigating. After you clearly define the effect, you can determine how you will estimate effect sizes from the data.

What is an effect size?

Now, what exactly does ‘effect size’ mean in this context? There are two main ways of thinking about effect sizes: 1) In modern meta-analysis, an effect size is essentially an estimate of the magnitude of the effect of the predictor variable (or treatment) of interest. 2) However, in some of the early simpler forms of meta-analysis, such as vote counting and combining probabilities, the effect size can be interpreted as a statistical measure of the degree to which the null hypothesis is false. Effect sizes can be calculated in a variety of ways depending on characteristics of the data and the variables of interest (e.g. is the predictor variable continuous or categorical?, or is the response variable binary or continuous?). We call these different forms of effect sizes “effect size metrics.” In an experimental study with a categorical predictor variable, the effect size is the response of one group relative to the response of another (e.g. treatment vs. control). In studies where the relationship between the response of interest and the predictor variable are measured on continuous scales (such as in a study on the response of metabolic rate to temperature across a gradient of temperatures), the effect size could be either the strength (and direction) of the relationship between the predictor variable and the response variable (i.e., the correlation coefficient, (r)), or an estimate of a parameter in a biological model, such as the slope of the line describing the relationship between the two variables.

When conducting a meta-analysis, the goal is to calculate the effect size from each of the primary studies being synthesized, and then use those effect sizes to estimate the overall effect size of the treatment of interest across the studies. This estimate of overall effect size and its variability across studies is then used to determine whether the effect of interest is significant. It also allows you to determine how large the effect is, and whether it is heterogeneous across studies. Sometimes the primary studies comprising a meta-analysis have different study designs and data structures, requiring different effect size metrics (more on how to calculate different effect size metrics later). These different effect size metrics can sometimes be converted into a single metric—a common currency so they can be compared on the same scale (apples to apples and not apple to oranges). However, sometimes no accurate conversion exists, and the different effect size metrics must be analyzed separately.

There are two main types of effect size metrics used in meta-analysis, which differ in the scale that the treatments or groups are compared to one another–non-standardized (i.e. raw) or (error) standardized. Non-standardized metrics keep the same scale as the original data, and thus have units. Standardized metrics are normalized by an error term so that they don’t have units, and thus individual study effect sizes are easy to compare to on another without having to convert to the same scale. Within these two types, effect size metrics can then differ in whether the effect size is calculated based on the difference between two groups, or their ratio. These categories can be arranged in the following 2 x 2 matrix, which has example effect size metrics for each category within each cell:



Within these four broad forms, effect size metrics can also differ in whether they incorporate state variables into the (bar{X}) term (e.g., body mass at the end of an experiment), vs. derived processes or rates (e.g., growth rate). We’ll describe these different combinations of forms in more detail below, including their advantages and disadvantages, and when it is appropriate to use each. We’ll also provide some examples of common effect size metrics that fall within each category, and the equations used to calculate them.

Overview of common effect size metrics

Standardized (i.e., statistically-derived) effect sizes

Standardized effect size metrics are generally the most ubiquitous effect sizes used in ecology. They include p-values or test statistics: (r), (F), (Z), (t), (chi^2), as well as Hedges’ (d), hence they can also be thought of as statistically derived-effect sizes. These effect size metrics are generally standardized by an error term (typically the standard deviation of a study or a pooled standard deviation term), and thus they are dimensionless (units of the denominator cancel out units in the numerator). This standardization by the error permits the direct comparison of all the effect sizes in the meta-analysis, because their individual scales have been removed. This also means this type of effect size is essentially a signal to noise ratio. Although these features make this type of effect size convenient to work with, they have several disadvantages.

The first disadvantage of this type of effect size is that they are essentially aggregate tests of the null hypothesis, i.e. that there is no effect (Osenberg et al. 1999), which has been argued by many researchers to be both a relatively uninformative hypothesis, and biologically uninteresting. Since the goal of modern meta-analysis is often to quantify the magnitude of the effect of interest, an approach that does not confound the signal of an effect with the noise around it is often more useful. Understanding why these effect size metrics can be viewed this way can be confusing, so let’s go back to the basics, starting with p-values and their interpretation.

P-values

P-value-based effect sizes are the simplest effect size metrics used in meta-analysis. They are used in vote counting and combined probability (e.g. Fisher’s (p)) approaches to meta-analysis, which are both historical methods that are used less regularly today.

What is a p-value?

  • A p-value is the probability of getting the observed results (or more extreme) if the null hypothesis ((H_0)) is correct, given the observed variation
  • P-values give you strength of evidence against a null hypothesis ((H_0)). In a study where (p = 0.05), there is stronger evidence against (H_0) (i.e. stronger evidence for rejecting it) than a study where (p = 0.50)

P-values are used as effect effect size metrics in the following two ways:

Vote counting

Vote counting involves finding studies that test a similar hypothesis/effect, tallying up the results of each study as significant (positive or negative) or non-significant, determining the proportion in each category, and then comparing that ratio to the null hypothesis (e.g., 5% significant and 95% n.s.). Examples of vote counting include:

  • Connell. 1983. On the prevalence and relative importance of interspecific competition: Evidence from field experiments. Am. Nat. 122: 661-696.
  • Schoener. 1983. Field experiments on interspecific competition. Am. Nat. 122: 240-285.

[Future: link to page with example calculations]

Combined probabilities–Fisher’s p

A slightly more complicated effect size metric based on p-values is Fisher’s Combined probability, or Fisher’s (p). This test type incorporates more information than vote counting, and thus is a more powerful test of the null hypothesis. It directly uses the p-values calculated in primary studies as a measure of the amount of support that the study found against the null hypothesis. Thus, Fisher’s (p) uses continuous rather than discrete (sig. vs. n.s.) information from individual studies. Fisher’s (p) can be calculated using the following equation: [T = -2sum_{1}^{n}log(P_i)] Where (n =) the number of studies, (P_i =) the p-value for study (i), and (T) is tested against a (chi^2) distribution with (df = 2n).

[Future: link to page with example calculations]

Problems with p-values

P-values and effect size metrics based off of them have the advantage of being flexible and simple. However, they also have some significant disadvantages. First and foremost, they cannot tell you the magnitude of the effect size. Instead, p-values only tell you if there is an effect. For example, in the figure below, the second study has a significant effect, but this effect is very small (i.e. it’s close to zero on the y-axis). In contrast, the fourth study is also significant, but has a much larger effect size. These differences in effect sizes could have important biological consequences (e.g., if the effect sizes were differences in growth rates). However, using only p-values, you could not distinguish the difference in the effect sizes of these studies from one another. Thus, p-values can give you information on statistical significance, but not biological significance.



 

Other standardized effect size metrics

As we mentioned before, more informative effect size metrics allow the magnitude of the effect to be determined. Modern meta-analytic effect size metrics try to estimate this magnitude, but do it in a variety of ways. Standardized effect size metrics provide a standardized and scaled estimate of this magnitude, which we discuss more below in the context of the two most common standardized effect size metrics, Hedges’ (d) and the correlation coefficient ((r)).

Standardized difference between means

This type of effect size adjusts for differences in scale among studies, and/or differences in variance among studies and groups. Common formulations for this type of metric include Glass’s (lambda), Hedges’ (g), Cohen’s (d), and Hedges’ (d) (Hedges and Olkin 1985), the latter of which is the mostly commonly used effect size metric in modern meta-analysis.

Hedges’ d

Hedges’ (d) = difference between means of the experimental treatment group ((bar{X_E})) and the control ((bar{X_C})) group, then standardized by dividing by the pooled standard deviation ((s)) from both groups, all of which is adjusted for the sample size by a correction term (J).
[d = frac{(bar{X_E} – bar{X_C})}{s}J]

The pooled standard deviation incorporates the sample size of each group ((n_E) and (n_C)), as well as their variance ((s^2)). [s = sqrt{frac{(n_E – 1)s_E^2 + (n_C -1)s_C^2}{n_E + n_C -2}}]

The sample size correction term incorporates the sample sizes ((n)) of each group. [J = 1 – frac{3}{4(n^C + n^E -2)}]

The variance of (d) is: [V_d approx frac{N_{C,i} + N_{E,i}}{N_{C,i}N_{E,i}} + frac{d_i^2}{2(N_{C,i} + N_{E,i})}] Note that this variance equation is more complex than a simple variance calculation done for a single study, and will differ for different effect size metrics. We’ll discuss how to derive the correct variance for a given effect size metric using the Delta Method in Deriving effect size variance. However, for many common effect size metrics, these equations are already published, so you can just look them up.

Limitations or issues with (d)

Although (d) has several advantages, such as the ease at which it can be applied to data, it has a few issues that a meta-analyst should be aware of before they chose it as an effect size metric. Looking at the first equation for (d), you can see that it provides a signal ((bar{X_E} – bar{X_C})) to noise ((s)) ratio, meaning differences in (d) may reflect either differences in the magnitude of the effect (the signal) or differences in variation among groups (the noise). Thus, it is similar to a null hypothesis test (think back to the figure in the p-value section). Furthermore, dividing by an error term and thereby removing the scale from the effect size makes it very hard to interpret any biological meaning from (d) (i.e., how do you compare its value to nature?) This is a major issue with its interpretation!!!

Finally, (d) also runs into some issues with bias depending on the weighting scheme used with it. We will talk about this issue more in the [Statistical model & weights module].

 

Correlation coefficient, (r)

The correlation coefficient ((r)) is another commonly used as effect size metric in ecology. It can be used for studies that have a continuous predictor variable (regression-based analyses), although (r) can be calculated from various test statistics such as (t), (F), or (chi^2) (see examples below; more equations in (Koricheva, Gurevitch, and Mengersen 2013)). (r) measures the strength and direction of the relationship between two variables. It ranges between -1 and 1, and assumes that the variables have a linear relationship. Negative values indicate a negative relationship, and positive values indicate a positive relationship (and (r = 0) indicates no relationship). This metric is also useful because it can be converted to Hedges’ (d), allowing primary studies using both continuous and categorical predictor variables to be incorporated into a single meta-analysis.

Linear regression: [r = beta frac{SD_x}{SD_y}]

One-way ANOVA: [r = sqrt{frac{F}{F + df}}]

Chi-squared [r = sqrt{frac{chi^2}{N}}]

Converting (r) to Hedges’ (d) [d = frac{2r}{sqrt{1 – r^2}}]

Limitations of (r)

(r) has some limitations that are similar to (d), as well as some other issues that we’ll deal with in more detail in class, and in the weighting module. [Future: link to weighting module]

 

Summary of standardized (i.e. statistically derived-) effect size metrics

Standardized effect sizes are widely used, flexible metrics for meta-analysis. They allow easy comparisons across primary studies because they remove the scale (i.e. units), and many are easily convertible from one effect size metric to another. However, this makes them difficult to compare in an absolute sense to real world rates, states, or other observable and biologically meaningful values. They all incorporate (1 / (an~error~term)) in their calculation, which means they represent a signal (the observed effect size) to noise (the error) ratio. This can be a useful way to think about some relationships, when variance might be an important component of an effect (e.g. genetics), but it can also mean signal and noise are confounded with one another, such that the contribution of either to the result can’t be isolated. When variation in not considered part of an effect, and rather is thought of as just uncertainty around it due to experimental error, random processes, etc., incorporating it into an effect size calculation is therefore philosophically questionable. Despite these flaws, standardized effect sizes can be useful as more powerful combined tests of the null hypothesis. However, if the goal is to compare effects and link theory to data, other effect sizes may be more preferable.

 

Non-standardized (i.e. raw) effect size metrics

The other major category of effect size metrics is the raw, or non-standardized metrics. These metrics are not based on statistical test statistics, and do not incorporate an error term into the effect size calculation (error is accounted for later with weighting). This can be advantageous when the meta-analyst wants to be able to include studies that don’t have variance data in the meta-analysis (e.g. large-scale whole-system studies without replication, like whole-lake experiments). Additionally, because there is no error term, these metrics are not reliant on the signal to noise ratio of a biological process.

This category can be further divided into two sub-types, difference-based vs. ratio-based effect sizes, which differ in whether the responses of two groups being compared are calculated as a difference (e.g.,(X_E – X_C)) or as a ratio (e.g. (X_E / X_C)). Which effect size metric that you use depends on characteristics of the data, such as the distribution of the effect sizes collected across all the primary studies, the type of response variable (is it continuous, a count response, or a proportion that is bounded between 0 and 1?), and do you think the process you are investigating is additive or multiplicative?, etc. See below for details on when you would choose each sub-type, and examples of common effect size metrics in each.

 

Difference-based metrics

There is one main non-standardized effect size metric that is based on differences, the response difference, which is also known as the raw mean difference.

Response difference:

[(bar{X_E} – bar{X_C})]

This is one of the simplest effect sizes, which provides a simple difference in the mean response of two different groups to a predictor of interest. It is particularly useful when knowing the absolute effect of a treatment is important. For example, if the goal is to understand how some management activity affects species richness, it may be more useful to understand how many species are lost in one management treatment vs. another, rather than just knowing the proportion of species lost (because a management action that saves 50 species out of 100 might be more meaningful than an action that only saves 4 species out of 8).

Limitations of the response difference

However, despite the ease of calculating this effect size metric, it has some limitations that make other metrics like the log response ratio preferable in certain contexts. The response difference is best used when the process of interest is believed to be additive (like a linear growth function), not for multiplicative processes (e.g. exponential growth). Additionally, since this is a non-standardized metric, the units are important, and thus all studies should be converted to the same scale before their effect sizes are summarized. Finally, the range of possible effects sizes is limited by the values of (X_E) and (X_C).

 

Ratio-based metrics

Ratio-based effect sizes are useful when it is informative to consider the proportional response of one group relative to another, and knowing the absolute difference is less important. There are several main effect size metrics in this category, which vary in whether they work with discrete or continuous response data, and if the processes leading to the data describe an additive or multiplicative process.

 

Response ratio

The response ratio is the simple ratio of the mean response of one group relative to another: [(bar{X_E} / bar{X_C})] One advantage of this effect size metric is that primary study results are often reported as response ratios, which means they can be used in this type of effect size, even when the study doesn’t report the raw means for the different groups being compared. Additionally, the units conveniently cancel out in this effect size metric, meaning it is fairly easy to compare across studies.

Limitations of the response ratio

However, this metric often results in data with statistically problematic characteristics, such as asymmetric distributions. This issue can often be dealt with using a log transformation, resulting in the next effect size metric, the log response ratio.

 

Log response ratio

The log response ratio, often abbreviated as lnRR, is just the ratio of the natural log of the mean response in one group to the log of the mean of another group: [ln(bar{X_E} / bar{X_C}) = ln(bar{X_E}) – ln(bar{X_C})] The variance ((V_{lnRR})) of lnRR is: [V_{lnRR} = s^2_{pooled} left( frac{1}{n_E(bar{X_E})^2} + frac{1}{n_C(bar{X_C})^2} right)] Where (s_{pooled}) is the pooled standard deviation, and (n) is the sample size of each group, and (bar{X}) is the mean of each group.

This effect size metric is often more appropriate than the simple response ratio for many biological processes. This occurs for two related reasons: 1) Many biological processes are multiplicative rather than additive, meaning the data must be log transformed to make it linear (which enables it to be used in statistical analyses like a linear regressions. 2) Effect sizes calculated using response ratios typically have asymmetric distributions when they are compiled across many studies (this is a characteristic of proportion data in general). This asymmetry violates assumptions of most parametric statistical tests. To remove this issue, effect sizes can be log-transformed to normalize their distribution. (Means or standard deviations calculated from these log-transformed values can later be back-transformed to the original scale to make these effect size more interpretable.)

Limitations of the log response ratio

This effect size metric can only be calculated if the numerator and denominator have a non-zero value, and if their ratio is positive. (Adding a small number like 0.001 to a zero value can eliminate the zero problem, but introduces a bias that sometimes affects the overall effect size estimate.)

 

Odds ratio

Odds ratios (OR) are effect sizes that are useful for contingency table data—data based on counts for events with binary outcomes (e.g. alive vs. dead) across two treatment groups. Odds are equivalent to the probability of an event occurring ((P_1)) divided by the probability that the event does not occur ((1 – P_1)). Odds ratios are based on the odds of an event occurring in one group relative to the odds of the same event occurring in the second group. [frac{P_1/(1-P_1)}{P_2/(1-P_2)}]

[Future: click here for an example of the odds ratio]

Odds ratios are useful when trying to calculate the magnitude of an effect using bounded proportional data, because it gives you a metric that is independent of where you start on the response curve. [Future: add vignette illustrating this]

Odds ratios typically don’t have data distributions that are amenable to parametric statistical tests. Therefore, like response ratios, they can be log-transformed into log-odds ratios (lnOR), which can amerliorate these problems with data distributions.

 

Tying effect sizes to theory, or what goes into (bar{X})?

Besides these broad categories of effect size metrics, a meta-analyst should also consider what type of data they will extract from the primary studies to plug in to (bar{X}) in the effect size metric calculation. Some options include:

  • state variables (e.g., mass at the end of an experiment)
  • rates (e.g., growth rate)
  • parameter values (e.g., finite rate of population increase (lambda))

What the meta-analyst decides to incorporate into their effect size metric for (bar{X}) will influence which studies they can include in the meta-analysis, and what types of data processing and/or conversions they will need to perform on individual studies to get their effect sizes in the desired form that is consistent among studies.

This choice will depend on a variety of factors, such as common data structures in the primary studies, and the exact research question for the meta-analysis. However, to explicitly link an effect size estimate to theory or biology, the effect size should be defined by the specific process being investigated. For some questions this is easier than others, such as in fields with well-developed models for a process of interest. Let’s take an example from physiological ecology. In this field, scientists often put organisms in containers, like so, and measure the change in oxygen in the container for a set period of time:

Fig.)This is a tuna losing its mind as it continues to swim, but goes nowhere in a flume with a constant flow rate. As it does this, the concentration of oxygen in the water is continuously measured. Photo credit:?

Fig.)This is a tuna losing its mind as it continues to swim, but goes nowhere in a flume with a constant flow rate. As it does this, the concentration of oxygen in the water is continuously measured. Photo credit:?

 

The response that they report and compare between treatments, however, is the rate of change in oxygen, not the concentration of oxygen at the end of the incubation period. This rate is essentially the slope of the line of oxygen plotted against time. This seems like a fairly obvious choice, but there are many areas in ecology with less developed theory (or less consistently applied theory) where some studies report the equivalent of the oxygen concentration as their response. When conducting a meta-analysis, choosing the form of the response from primary studies that most explicitly links the effect size to biology or ecology (such as the respiration rate), rather than just a state in the system (e.g. oxygen levels) is preferred. Sometimes this means that the studies that don’t report results in the prefered format have to be converted into the prefered form. When this is not feasible due to the reported information, the study has to either be omitted from the meta-analysis, or multiple effect size metrics must be used and analyzed separately.

In relatively well-understood systems that have a mathematical model describing the process of interest, using a parameter from the model as the (bar{X}) value is ideal. This allows the strength of a specific process to be quantified, which aids in the interpretation of results. We’ll cover some examples of this in class.

Limitations to parameter-based effect sizes

One limitation of this approach is that you can only use it on studies of systems that are described by the same model for the process of interest. Sometimes this is not known. In that case, there are two possible approaches. One is to make your best guess at what the model would be, and assume it fits across systems (e.g., populations of a species exhibit logistic growth across its entire range). If you cannot reasonably make this assumption, or in systems where you know that fundamentally different models occur for the same process, then using effect size metrics that compare state variables (i.e. measured outcome variables like the number of species in a plot) may be necessary. However, having at least a qualitative model for the relationship you are interested in is still important so that you can make sure your effect sizes are calculated from comparable ecological dynamics.

[Future: clarify section & add examples]

 

Conclusions

There are many considerations when deciding on an effect size metric. However, because of the important consequences for the results and subsequent conclusions drawn from a meta-analysis, choosing the best metric for the question and the data at hand is very important, and should be explicitly justified in any meta-analysis.

 

Later in the modules we’ll discuss how to make effect size calculations in R. For now, let’s keep covering the conceptual basics, so [click here] to proceed to the next section on incorporating effect sizes into statistical models, and how to weight effect sizes from individual studies.


Last updated: 2019, January 17  

References

  • Connell. 1983. On the prevalence and relative importance of interspecific competition: Evidence from field experiments. Am. Nat. 122: 661-696.
  • Hedges and Olkin. 1985. Statistical methods for meta-analysis.
  • Osenberg et al. 1999. Resolving ecological questions through meta-analysis: Goals, metrics, and models. Ecology. 80: 1105-1117.
  • Schoener. 1983. Field experiments on interspecific competition. Am. Nat. 122: 240-285.

Koricheva, Julia, Jessica Gurevitch, and Kerrie Mengersen. 2013. Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press. https://muse.jhu.edu/book/41629.



Close Menu