Exploring heterogeneity
Introduction to heterogeneity
One major goal of meta-analysis is to quantitatively summarize the effect sizes observed in multiple studies on the same topic. A second major goal is to explore whether these effect sizes (and the true effect sizes that they estimate with some error) are consistent across studies, and if not, to determine what factors lead to variation in the magnitudes of the true effects. In meta-analysis, this variation in the true effect sizes is known as heterogeneity. For example, it may be interesting to determine if the strength of an effect, such as the magnitude of a trophic cascade, is similar across studies conducted in different ecosystem types, or in studies with different experimental durations. To determine whether variation in the observed effect sizes is indicative of real differences in the true effect sizes, and to determine if there are moderating variables that explain the variation in true effect sizes, meta-analysts can conduct formal tests of heterogeneity.
Factors that can contribute to heterogeneity in (true) effect sizes:
- heterogeneity among studies
- heterogeneity among groups (e.g., studies conducted in different ecosystem types)
- covariates (e.g., treatment dose, experimental duration, or some other continuous predictor variable)
The basic concept behind heterogeneity tests is that the effect sizes observed in primary studies are estimates of true effects, and these estimates are measured with some error. Therefore, when we conduct heterogeneity tests we have to try to separate the variation in effect sizes due to measurement error from the variation due to real differences in the true effects.
Evaluating heterogeneity: (Q)
One common way of evaluating heterogeneity in classic meta-analysis is based on the (Q) statistic (also known as Cochran’s (Q)). The (Q) statistic partitions the total variation of the observed effect sizes ((Q_{total})) into two components: (Q_{model}) and (Q_{error}), e.g.: [Q_{total} = Q_{model} + Q_{error}]
(Q_{model}) is the amount of heterogeneity that is explained by the meta-analysis model, and is assumed to be due to real variation in the true effect sizes among studies, groups, and/or whatever moderating variables of interest are included in the model. If there is only one factor in the model (e.g., study), then (Q_{model}) would indicate the amount of heterogeneity in the true effect sizes among levels of that factor. If there are multiple variables in the model, then (Q_{model}) is a cumulative term. (Q_{error}) is due to random, unexplained variation (e.g., measurement error).
For a simple fixed effect model, total heterogeneity (Q_{total}) is calculated using the equation: [Q_{total} = sum_{(i=1)}^k w_i (e_i – bar{E})^2] Where:
- (bar{E}) is the overall mean effect size
- (e_i) is the effect size of the (ith) study
- (w_i) is the weight for the (ith) study
- (k) is the total number of studies in the meta-analysis
- and the degrees of freedom ((df)) associated with this term are (k-1)
In a random effects model the formula for (Q_{total}) would be the same, except the weights would be the random effects version that incorporate both within- and among-study variance.
*Note that (Q) requires estimates of within-study variance to be able to parse out “true” variation in the observed effect sizes from variation due to sampling error. Therefore, parametric (i.e., variance-based) weights are required for its calculation. Other methods must be used to evaluate heterogeneity in models incorporating non-parametric weights, such as model comparison (e.g., comparing a model including study ID as a predictor variable with a model that does not include study ID). See this page from the metafor
website for some examples.
Heterogeneity among studies
Generally, unless all the primary studies in your meta-analysis are conducted under a very narrow set of conditions and procedures, it is reasonable to expect some degree of variation in the observed effect sizes among studies, such as in Fig.1:
Fig. 1) Example of potential heterogeneity among studies in a meta-analysis. Blue circles represent the observed effect sizes, with bars representing their 95% CI.
A heterogeneity test is a way to determine whether these differences in the observed effect sizes are caused by differences in the true effect sizes among studies in the meta-analysis, or if they are instead due to sampling error within individual studies. If there are no groups or explanatory variables in your model, (e.g., when you are just exploring heterogeneity among the effect sizes, which you assume under the null hypothesis all come from the same population), then (Q_{model} = 0), because you assume the true effect is the same across all studies. When (Q_{model} = 0), then all heterogeneity in (Q_{total}) comes from (Q_{error}).
To determine if the amount of variation in the observed effect sizes is likely given the assumption that they all come from the same population, you can compare the (Q_{total}) to a (chi^2) distribution. If (Q_{total}) is larger than predicted for a given (alpha), (i.e. if your p-value is < .05), then this indicates that you have more unexplained variation than would be expected by chance if (Q_{model} = 0).(^{**}) In this case, you can infer that the true effect sizes differ among your studies, and thus using a fixed effect model is not appropriate for the data. Instead, you should use a random effects model.
However, as mentioned in the previous section on Statistical models, heterogeneity tests based on (Q) have low power when there are few studies in the meta-analysis, so a non-significant p-value does not necessarily mean that the true effect sizes are the same across studies; the test could simply not have enough power to detect significant differences. Similarly, (Q) tests can also have too much power if there are many studies in the meta-analysis, especially if some are very large (meaning the test finds significant differences when the true effect sizes are actually homogeneous) (Higgins et al. 2003).
Heterogeneity among groups
In addition to testing for among-study differences in effect sizes, sometimes there are groups within the data that could hypothetically explain some of the variation in effect sizes observed across studies. For example, in a meta-analysis on the effects of maternal temperature environment on the response of offspring to elevated temperature, there could be differences in the effect sizes of ectotherms vs. endotherms.
Fig. 2) Example of potential heterogeneity among groups in a meta-analysis. Blue circles represent the observed effect sizes, with bars representing their error. Studies 1-3 were conducted on ectotherms, while studies 4-5 were conducted on endotherms.
You can see in Fig. 2 that the effect sizes for the studies performed on ectotherms cluster above zero, while the effect sizes for the studies on endotherms cluster below zero. This pattern indicates that the groups ectotherm vs. endotherm might drive significant heterogeneity in the effect sizes. To test this, you could use thermo-regulatory type as a moderating variable in the meta-analysis model. Then, you could estimate the amount of variation explained by thermo-regulatory type using (Q_{model}), i.e., the amount of among-group variation, with (Q_{error}) representing leftover unexplained variation. Another way to think about this is to rewrite (Q_{model}) as (Q_{groups}).
[Q_{total} = Q_{groups} + Q_{error}]
Thus, (Q) operates similarly to ANOVA, where variation in the form of sums of squares (SS) are partitioned across groups (i.e. treatment levels) and error terms, e.g. (SS_{total} = SS_{groups} + SS_{error}). These SS terms are then compared to determine if the variance explained by the groups is greater than the unexplained error variance (this is also known as residual variance). (Q) is simply a weighted SS, and thus, as in ANOVA, you are testing whether heterogeneity is greater among the groups than it is within them. In ANOVA this would be done using the F-ratio, (which equals the variation among / variation within), which you then compare to a normal distribution to determine the p-value. In meta-analysis, this is done by comparing (Q_{groups}) to a (chi^2) distribution.
Thus, another way to think about this for groups is: [Q_{total} = Q_{among} + Q_{within}]
To calculate (Q_{groups}): for a fixed effect model: [Q_{groups} = sum_{(j=1)}^{m}sum_{(i=1)}^{n_j} w_{ij} (bar{E_j} – bar{E})^2] Where:
- (bar{E}) is the overall mean effect size (specifically its estimate based on observed effect sizes)
- (bar{E}_j) is the mean effect in group (j) (based on averaging (e_{ij})’s)
- (w_{ij}) is the weight for the (ith) study in group (j)
- (m) is the number of groups
- (n_j) is the number of studies in group (j)
- Total number of studies is: (k = sum_{j=1}^{m}n_j)
- (df = m – 1) for this term
If (Q_{groups}) is not significant, it indicates that this moderating variable does not explain any heterogeneity in effect sizes in the meta-analysis model. This result can be compared with (Q_{error}). If (Q_{error}) has a significant p-value, then the effect-sizes still have some significant unexplained heterogeneity (and therefore the meta-analyst might want to investigate some other moderating variables).
(Q_{error}), i.e. the residual heterogeneity, is calculated this way: [Q_{error} = sum_{(j=1)}^{m}sum_{(i=1)}^{n_j} w_{ij} (e_j – bar{E_j})^2]
- (df = k – m) for this term (i.e., the total number of studies minus the number of groups)
This approach of using (Q_{groups}) to evaluate the ability of categorical moderating variables to explain heterogeneity in the effect sizes can be expanded to models with multiple moderators, or hierarchical models, where you have one set of groups nested within another (e.g. sites nested in regions). It can also be extended to random effects models. However, although the hypothesis testing and interpretation of the different (Q)’s is still the same, the calculation of each (Q) becomes slighly more complex for these more complicated models. (See Koricheva, Gurevitch, and Mengersen 2013 for more details).
Evaluating heterogeneity due to continuous covariates
(Q) can also be applied to models with continuous variables (i.e., covariates), which may drive variation in the effects sizes observed across studies. This is analogous to a weighted regression analysis, and is known in meta-analysis as a meta-regression. For example, if experimental duration was related to the effect size observed in different studies:
Fig. 3 shows a meta-regression using the effect of experimental duration (x) on the log-risk ratio (y). Points represent effect sizes observed in primary studies. The size of each point indicates the relative weight of that effect size (larger = higher weight).
In this type of model, the mean effect size of each study (bar{E_i}) varies as the value of covariate (X_i) changes, with a slope (beta_{1}) , i.e.:
[bar{E_i} = beta_0 + beta_1X_i + varepsilon_i] Just like the formulation for a regression model, (beta_0) is the intercept of the line, and (varepsilon_i) represents residual error for study (i).
Significance for the different components of heterogeneity are determined the same way as with the categorical moderating variables, by comparing (Q) to a (chi^2) distribution with the corresponding degrees of freedom ((df)). The formula for (Q_{total}) is the same as before (Eqn. X). However, the calculations for (Q_{model}) and (Q_{error}) change.
(Q_{model}) changes to evaluate heterogeneity associated with the slope (beta_{1}): [Q_{model} = frac{beta^2_{1}}{s^2_{beta_{1}}}]
Where (s^2_{beta_{1}}) is the standard error of (beta_{1}), and the (df = 1). (See Koricheva, Gurevitch, and Mengersen 2013 Ch. 9) for the formula for (s_{beta_{1}}). Many statistical packages like metafor
will calculate it for you, and provide you with the correct (Q_{model}) output.
If there is no heterogeneity explained by the covariate, then (beta_{1}) should equal zero. To test the null hypothesis that (beta_{1} = 0), calculate (Q_{model}) and compare it to a (chi^2) distribution as before.
Alternatively, if you want to test whether (beta_{1}) or (beta_{0}) are significantly different from (0), then you can divide their estimate by their standard error ((beta_{} ~/~ s_{beta_{}})), and compare to a normal ((z)) distribution. (So there are two ways to test if (beta_{1} = 0), which yield equivalent results.)
(Q_{error}) has a very complicated formula, so it is easiest to calculate it as: [Q_{error} = Q_{total} – Q_{model}] With (df = k-2), and (k) = total number of studies
Heterogeneity tests in metafor
metafor
will calculate (Q) and conduct the appropriate significance test based on (df) for you.
We will go over how to do this in class.
Summary of (Q)
(Q) provides a formal way to test for heterogeneity in effect sizes, whether due to differences among studies, or due to other moderating variables (categorical or continuous). It relies on paritioning variance components, and thus it requires parametric (variance-based) weights. Additionally, (Q) is on a standarized scale, meaning it is not affected by the effect size metric. However, this scale-free nature means the amount of heterogeneity that it represents is harder to interpret, as it is not in the same units as the original data. Furthermore, (Q) is a sum, and therefore it cannot be compared across meta-analyses because its value depends on the number of studies in the meta-analysis. Finally, although (Q) allows the evaluation of whether or not there is significant among-study variation in effect sizes, due to frequent issues with power leading to the wrong conclusion, it is not recommended as a tool for deciding whether to use a fixed effect model or a random effects model.
Other measures of heterogeneity
In addition to (Q), other metrics have been developed for quantifying heterogeneity in a meta-analysis. The main ones include (T^2), (T), and (I^2). These metrics all have their benefits and limitations, which we will briefly discuss in the following sections.
(T^2)
(T^2), which we discussed in the Statistical models section, estimates the among-study variance, based on the observed effect sizes and their within-study variance. Thus, it is used to estimate variation in the true effect sizes, and provides an absolute value for this estimate. However, it does not quantify how much of the observed variation is real and how much is just sampling/measurement error. It is not sensitive to number of studies, but is sensitive to effect size metric because it is calculated on the same scale as the observed effect sizes. There are multiple ways to calculate (T^2). The DerSimonian and Laird method section, is one of the most common and simplest methods to calculate, however, it has some drawbacks in certain contexts [Future: add more here]. However, increases in computational efficiency have made other methods possible, including restricted maximum likelihood (REML). Due to the benefits of this approach [Future: add more here], many statistical programs are beginning to default to REML, including metafor
.
(T)
(T) is the square root of (T^2), and estimates the among-study standard deviation. Like (T^2), (T) estimates variation in true effect sizes, and is on the same scale as them, but has the added benefit of being in the same units as the original effect size measurements. Thus, it is easier to interpret how much variability there is around those effect sizes. However, due to the dependence of (T) and (T^2) on the scale of original effect size measurements (which depends on the effect size metric), it is difficult to compare (T) and (T^2) among meta-analyses using different effect size metrics.
(I^2)
One metric for estimating heterogeneity that removes the scaling issues associated with (T) and (T^2) is the (I^2) statistic. This statistic gives an estimate of heterogeneity on a relative scale, allowing comparisons of heterogeneity across meta-analyses. (I^2) estimates the percentage of total observed heterogeneity that is attributable to real differences in the true effect sizes, and not due to random error. Therefore, it can be used to indicate the percentage of total observed heterogeneity that is attributable to among-study variance, or to the variance among levels of a moderating variable of interest (Higgins et al. 2003). Thus, it can be thought of as: [I^2 = left( frac{V_{among}}{V_{total}} right) times 100] If (I^2) is zero, it means that all of the observed variation is due to random (e.g., measurement) error and not due to heterogeneity, i.e., variation in the true effect sizes. If it is close to 100%, it indicates that most of the observed variation is due to heterogeneity in effect sizes.
The formula for calculating (I^2) in a random effects model is: [I^2 = lgroup frac{Q_{}-df}{Q_{}} rgroup times 100] The (Q) in the numerator and denominator is (Q_{total}) if the goal is to explore among-study heterogeneity (meaning that (df = k -1), with (k =) the number of studies in the meta-analysis.
(I^2) is a useful statistic because it depends less on the number of studies in the meta-analysis than (Q), however it still can have low power when sample size is small (REF). Additionally, it does not give information on the absolute value of the heterogeneity in effect sizes, so if this information is useful for the interpretation of results, (T^2) or (T) would be important to report as well.
Summary
There are a variety of ways to evaluate variation in effect sizes in a meta-analysis. (Q) provides a formal statistical test to assess differences in the true effects among studies, or the importance of moderating variables (which can be continuous or categorical) for explaining variation in the observed effect sizes. (T^2), (T), and (I^2) are meant to describe variation in effect sizes, either on an absolute ((T^2), (T)) or relative ((I^2)) scale. These metrics have different benefits and drawbacks, and often reveal slightly different information which can be helpful in explaining the results of a meta-analysis.
After conducting heterogeneity tests, the only remaining steps in a meta-analysis are to plot and interpret the results, and evaluate publication bias. We will address these topics in the next modules. Go to [module 9] to learn more about how to visualize meta-analysis results, and [module 10] to learn how to test for publication bias.
Last updated: 2019, January 29 Author: Amy Briggs
References
- Higgins et al. 2003. Measuring inconsistency in meta-analyses. BMJ. 327. 557-560.
Koricheva, Julia, Jessica Gurevitch, and Kerrie Mengersen. 2013. Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press. https://muse.jhu.edu/book/41629.
