Statistical models

1 Introduction

There are three main types of statistical models used for meta-analysis: 1) fixed effect models, 2) random effects models, and 3) hierarchical (aka multilevel) models. The type of statistical model you plan to use should be based on characteristics of the primary studies that you use in your meta-analysis, such as consistency in study design and system, non-independence of effect sizes, etc.

In this module, we will describe the simplest forms of the basic models. Specifically, we will consider models with no moderators (i.e., covariates); and with errors that are independent. Later modules will discuss how these basic model types can be modified to incorporate additional complexity. We’ll start with a quick summary of the three basic meta-analysis model types. We’ll then talk about weighting, mean effect sizes, and variance calculations for each type of model.

Fixed effect models
- assumes all effect sizes share a common true mean (so variation among observed effect sizes is due to sampling error, not variation in the true effect size across systems, locations, species, etc.)
- requires specifying within-study variance ([Equation])
Random effects models
- assumes observed effect sizes have different true effects
- the variation among the true effects is quantified by the among-study variance, ((tau^{2}))
- the variance among the observed effect sizes has two components: the among-study variance ((tau^{2})) and the within-study variance ([Equation]). If sample sizes in the original studies were infinitely large, there would be no within-study variance, and the variation among the observed effect sizes would be the same as the among-study variance.
the among-study variance is estimated by the model, whereas the within-study variance ([Equation]) is taken from the original study (as in the fixed effects model)
Hierarchical (multilevel) models
- assumes that effect sizes can be categorized into a higher organization level, and thus there is an additional random effect(among-groups and within-groups; the latter is the classic situation in a random effects model with a single group)
- assumes observed effects within groups are non-independent because observations within groups (e.g., effect sizes extracted from a single paper) tend to be more similar
- accounts for this non-independence within groups by incorporating random group effects

2 Fixed effect models

In fixed effect (FE) models, we assume that all observed effect sizes share a single true effect size (i.e., the among-study variance, (tau^{2}), equals 0, after accounting for the influence of explanatory variables, i.e., fixed effects
(Fig. 2.1). This means that the difference between any observed effect size and this true effect size is due only to sampling error. This type of model is appropriate when the goal of the meta-analysis is to make statistical inferences about the studies that are in the meta-analysis, but not necessarily about the population of all comparable studies. Fixed effect models are primarily used for meta-analyses containing studies conducted using very standardized experimental designs, the same study species, and the same response variable. As a result, FE models are not common in ecology, because we a priori expect there to be considerable variation in the true effects among studies ([Equation]). Formal statistical tests exist to assess the strength of the evidence against (tau^{2}) = 0, which we explain in the Exploring heterogeneity section.

A fixed effects model can be visualized as in Figure 2.1:

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 2.1: A fixed effect model, with three studies (A, B, C). The studies all share the same true effect (blue diamonds), but the expected variance around this effect arising from sampling error (indicated by the blue curve) varies from study to study (e.g., due to differences in the number of replicates). The observed effect differs from the true effect due to sampling error. Orange circles represent the observed effect size in a study, which is a random draw from the blue distribution. Thus, the deviation between the orange circle and the blue diamond in each study arises only from sampling error. The within-study variance observed from a study should, on average, be identical to the blue distribution. However, any single study’s observed within-study variance is only an estimate of the true within-study variance (blue distributions). Fixed effects models assume that the observed and true within-study variances are equal (which is unlikely to be true when replication is low). The bottom panel indicates that there is a single true mean effect (blue diamond), with no variance, but that the estimated mean (orange circle) will deviate from that true value due to the sampling error in each study.

2.1 Study weights for FE models

Fixed effect models use the observed effect sizes from primary studies to estimate the true effect size (the mean effect). The mean effect size can be estimated using an unweighted analysis; however, such an approach will be less precise than one that weighs the contribution of each observed effect size by the precision of that estimate. The most efficient weight for a fixed effects model is the inverse of the within-study variance: (Sánchez-Meca and Marin-Martinez 1998; Marın-Martınez, F. and J. Sa´nchez-Meca. 2010).

[begin{equation}
w_i = frac{1}{V(e_i)}
tag{2.1}
end{equation}]

Because a weighted analysis is generally more powerful than an unweighted analysis, most meta-analysts will extract within-study variances ((V(e_i))) at the same time they extract estimates of effects. When variances aren’t available (or are not desirable, see Hamman et al. 2020), there are other approaches for calculating weights. Examples include weighting by sample sizes to approximate variance-based weights (Hedges XXX), imputing variances (e.g., Kambach et al. 2020, Nakagawa et al. 2023), and using non-parametric weights (e.g., defined by the meta-analyst to represent study quality) combined with resampling methods (REFS).

2.2 The mean effect size and error in FE models

After calculating the observed effect size ((e_i)) and the weight ((w_i)) for each primary study, a mean effect size ((bar{E})) can be calculated. This mean effect size is an estimate of the true effect that is shared among studies.

The mean effect size ((bar{E})) is simply the weighted mean of the individual effects:

[begin{equation}
bar{E} = frac{sum_{i=1}^{k}w_{i}e_{i}}{sum_{i=1}^{k}w_{i}}
tag{2.2}
end{equation}]

and the variance of the mean effect size ((V(bar{E})))is:

[begin{equation}
V(bar{E}) = frac{1}{sum_{i=1}^{k}w_{i}}
tag{2.3}
end{equation}]

which can be converted to a standard error (SE) or Confidence Interval (CI):

[begin{equation}
SE_{bar{E}} = sqrt{V(bar{E})}
tag{2.4}
end{equation}]

[begin{equation}
upper~limit = bar{E} + 1.96 (SE_{bar{E}})
tag{2.5}
end{equation}]

[begin{equation}
lower~limit = bar{E} – 1.96 (SE_{bar{E}})
tag{2.6}
end{equation}]

To test whether the mean effect size is significantly different from 0, calculate the z-statistic:

[begin{equation}
z = frac{bar{E}}{SE_{bar{E}}}
tag{2.7}
end{equation}]

and then compare (z) to the standard normal distribution to get a P-value.

Fig. 2.2, provides results
[begin{equation}
upper~limit = bar{E} + 95% CI)
tag{2.5}
end{equation}]

[begin{equation}
lower~limit = bar{E} – 95% CI)
tag{2.5}
end{equation}]
for three different meta-analyses (A, B, C) that used the log-response ratio as an effect size metric (more on different effect size metrics in the next module[link]). When the CI does not include 0, the null hypothesis that the true effect is 0 can be rejected. In this example, meta-analysis A shows a significant effect, although there is lots of variability around the mean effect estimate. Meta-analysis B also shows a significant effect, but with less uncertainty around the estimated mean effect. The CI in meta-analysis C overlaps zero, so there is no demonstrable effect in that meta-analysis.

Figure 2.2: Effect sizes for three different meta-analyses (A, B, C). Points indicate mean effect sizes and error bars represent 95 % CI. The dashed line indicates an effect size of zero (in this case, that the mean of control and treatment are equal), and when a group’s CI does not overlap this line, it indicates the effect size is significantly different than zero.

Figure 2.2: Figure 2.2: Effect sizes for three different meta-analyses (A, B, C). Points indicate mean effect sizes and error bars represent 95 % CI. The dashed line indicates an effect size of zero (in this case, that the mean of control and treatment are equal), and when a group’s CI does not overlap this line, it indicates the effect size is significantly different than zero.

2.3 Coding FE models in R

Fitting a fixed effect model can be done in metafor using the following code:

# fit a fixed effect model
fixdmod <- rma(yi = e_i, vi = var_i,  method = "FE", data = dat)

The method = "FE" argument tells metafor to fit a fixed effect model. yi= refers to the data column with the observed effect sizes, vi= refers to the column with the variances of the observed effect sizes.

3 Random effects models

Random effects (RE) models assume that each study (i.e., observed effect sizes) has its own true effect size. Thus, there is a distribution of true effect sizes, which can be characterized by the “among-study variance ((tau^{2})) (Fig. 3.1). This in contrasts with fixed effect models, in which we assume there is no among-study variance. In ecology, the random effects model is usually recommended over fixed effect model because of differences in environments, seasons, locations, species, ecosystems, methods, etc. across studies, which could lead to variation in the true effect size. The among-study variance is estimated during the meta-analysis; however, an estimation can be problematic when there are relatively few studies (e.g., < 8).

Random effects models can be visualized as in Figure 3.1:

Figure 3.1: Random effects model. Blue diamonds indicate the true effect size, which differ across studies, and the dashed blue line represents the true overall mean effect size, based on the population of all comparable effect sizes. Orange circles represent the observed effect size for each study, with the width of the distributions around each effect size indicating the within-study variance. In this model type, because the true effects vary across studies, the overall mean effect size has a wider distribution compared to the fixed effect model.

3.1 Study weights for RE models

Variances in observed effect sizes in RE models arise from two components: the within-study error variance (i.e., variance arising from sampling error, as in the FE model) and the among-study variance (assumed to = 0 in the FE model):

[V_{i}~=~sigma_{i}^{2}~+~tau^{2}]

Where:

(sigma_{i}^{2}) is the within-study variance for the (ith) study (often referred to in the meta-analysis literature as the within-study variance). The observed within-study error variance is assumed to equal the true within-study error variance
(tau^{2}) is the among-study variance

Variation in the true effect sizes ((tau^{2})), i.e., the among-study variance, can be estimated in a variety of ways. In ecology, likelihood-based methods such as restricted maximum likelihood (REML) or maximum likelihood (ML) are the mostly commonly used approaches, especially for hierarchical models. REML is an efficient estimator with low bias in many contexts, so it is the default method used by meta-analysis software like metafor. Historically, the method of moments, also known as the DerSimonian and Laird method, was almost universally used. It usually does not perform as well as REML, however, it is still widely used in some disciplines because it does not require iterative calculations. To help illustrate concepts we will first consider this method of moments estimator, because we can calculate its value without yet addressing how we will estimate the overall mean. Later we will briefly touch on iterative estimators.

The method of moments (aka DerSimonian & Laird) estimator is:
[T^{2}~=~(Q~-~df)/C]
and
[df~=~k~-~1]

[begin{equation}
C = sum_{(i=1)}^kw_i – biggl(sum_{(i=1)}^kw_i^2 biggr) / biggl(sum_{(i=1)}^kw_ibiggr)
tag{3.1}
end{equation}]

[begin{equation}
Q= sum_{(i=1)}^k w_i e_i^2 – biggl(sum_{(i=1)}^k w_i e_i biggr)^2 / sum_{(i=1)}^k w_i
tag{3.2}
end{equation}]

Where:

(T^{2}) is the estimate of the true among-study variance, (tau^{2})
(w_i) is the weight for study (i) (based on within-study error variance; this is the same (w_i) as in the fixed effect model)
* (k) is the number of studies

Then, assuming that the random effects and observation errors are independent, the estimate of [Equation] can be combined with the within-study error variance (Eqn. 3.2) to calculate the weight ([Equation]) of each study:

[begin{equation}
w^*_{i}= frac{1}{V^* (e_i)}
tag{3.3}
end{equation}]

Here, the asterisk in ((w^*_{i})) is used to denote the random-effects version of the inverse-variance weights.
These weights are then used to estimate the mean effect size and its variance (Eqns 3.8-3.9).

Note about weights in random models:

Inverse-variance weights (as shown above) are typically used in random effects models. However, it is possible to use other weighting schemes such as sample size, but these approaches are most applicable when the alternative weighting metric is reasonably related to within-study variance. We will discuss the theory behind using non- weights in greater detail in the Weighting effect sizes module.

3.2 The mean effect size and error in RE models

The mean effect size ((bar{E})) as well as the variance of this mean effect ((V(bar{E^{*}}))) are calculated in the same way for the random effects model as in the fixed effect model, except that the weight has a different meaning (Eq. 3.2 vs. 2.1):

((bar{E}^*)):

[begin{equation}
bar{E}^* = frac{sum_{i=1}^{k}w^*_{i}e_{i}}{sum_{i=1}^{k}w^*_{i}}
tag{3.4}
end{equation}]

((V(bar{E}^*))):

[begin{equation}
V(bar{E}^*) = frac{1}{sum_{i=1}^{k}w^*_{i}}
tag{3.5}
end{equation}]

Recall that in a fixed effect model, the weights ((w_i)) are based only on the within-study variance. However, in a random effects model, the weights ((w_i^*)) incorporate both within- and among-study variances.

Standard error (SE), 95% confidence intervals. In a fixed-effect model, we used the standard normal (z-) distribution to calculate the CI and the z-statistic. This was appropriate because the fixed effect model assumes the among-study variance is zero, and thus the standard error is calculated only from the within-study error variance, which are assumed to be estimated without error (although this is false, this assertion has little affect on the results).

In a model with random effects, we can no longer assert that the among-study variance is zero. Instead we estimate it from the k studies in the meta-analysis. Thus, in a random effects model, the appropriate distribution to use for calculating CI and test statistics should be determined by the sample size of the meta-analysis (i.e., the number of studies). Despite this, it is common to see CI’s in random effect models based on the z-statistic.

## CI and test statistics for small sample sizes

Traditional random effects meta-analyses that use the z-distribution rely on the large-sample approximation, j which generally requires more than 20 primary studies to be accurate. If the meta-analysis contains fewer than ~20 studies, using a z-distribution will make estimates of the CI biased, causing them to be smaller than they should be and inflating the confidence we have in our results (Pappalardo et al.2020). Unfortunately, many ecological meta-analyses are based on a relatively small number of studies (Pappalardo et al. 2020). Better approaches attempt to correct for this potential bias. For example, the Knapp-Hartung correction for small-sample bias can be used in random effect models, and is related to the familiar correction in standard statistics using a t-distribution (see Jackson et al. 2016 for more information).

In metafor the Knapp-Hartung correction can be implemented for a random effect model by specifying test=“knha”:

# example code for a random effects model using Knapp-Hartung correction 
rma_mod <- rma(yi= e_i, vi= var_i, mods =  ~ fix.eff.1 + fix.eff.2 + (1|rand.eff1), 
               data=dat, test="knha")

Another method to generate more conservative (and reliable) CI’s when you have a small sample size is to perform bootstrapping.

3.3 Coding RE models in R

Fitting a model with a random study effect can be done in metafor using the following code, (assuming that each study has a separate row in the dataframe):

# fit a random effects model
randmod <- rma(yi = e_i, vi = var_i, method = "REML", data = dat)

Like the FE model, the columns for the observed effect sizes (yi) and their within-study error variances (vi) must be specified. The method argument specifies the type of model (fixed or random) and how (T^2) is estimated (random only). The default method of rma(…) is to fit a random effects model using REML, but DerSimonian-Laird (method=DL), maximum likelihood (method= ML), and other methods are available. If the random variable is not specified, metafor will assume each row represents a unique study, and use row as a random study effect. [Equation]See the metafor documentation for specifics.

4 Choosing between fixed and random models

Generally, the best practice is to choose the statistical model for your meta-analysis based on the characteristics of the primary studies in your data set, and whether or not it is reasonable to assume that they all share the same true effect size. (In the latter case, it would be appropriate to use a fixed effect model). There are formal statistical tests to determine if there is sufficient evidence to reject the hypothesis that (T^2=0). These tests of heterogeneity are based on whether the observed variation among effects cannot be accounted for only by within-study error. Common approaches use (Q) or (I^2) statistics, but Bayesian approaches exist as well.

It should be noted that using a heterogeneity test to choose between a fixed vs.random model type is not recommended by most statisticians. The first reason is that they often suffer from low power, meaning that they will often fail to detect heterogeneity in the true effect sizes when it really exists (Borenstein et al. 2009). If the meta-analyst proceeds with a fixed effect model after failing to detect heterogeneity when there actually is a difference in the true effect among studies, then this could bias the model results, usually. artificially enhancing statistical power.

Therefore, although tests of heterogeneity can be valuable (particularly because they include estimates of the partitioning of variance to within and among studies), they should not typically be used to determine the model type (see Koricheva, Gurevitch, and Mengersen 2013, pg. 105; and Borenstein et al. 2009, pg. 84, for more details on this controversy). See our next section on Exploring heterogeneityto learn more about how to perform these analyses.

5 Non-independence

Prior to discussing the third general type of meta-analysis model (hierarchical models), it is helpful to introduce the concept of non-independence. There are many types of non-independence that commonly occur in ecological meta-analyses, including temporal, spatial, and taxonomic non-independence, as well as non-independence related to experimental or study design. Despite the ubiquity of non-independence within ecological studies, meta-analyses frequently do not account for it. Hurlbert (1984) referred to this general type of unaccounted for non-independence as pseudoreplication, and many researchers and statisticians have discussed how it can undermine statistical inferences, including in meta-analysis. We next describe a few of these types of non-independence in the context of meta-analysis discuss some methods to address the non-independence.

6 Non-independence due to study design

Ecological meta-analyses frequently contain multiple effect sizes collected from a single experiment in the same paper or contain a large number of studies from a single taxon or from the same investigators (Noble et al.2017). For example, a primary study might measure multiple response variables (traits) for each study organism (Fig. ??a). Traits measured for the same individual could be correlated, therefore the effect sizes for the two traits are not totally independent. Similarly, effect sizes are non-independent in primary studies in which multiple treatments share the same control (Fig. ??b). Similar issues arise when an experiment is samples repeatedly through time (potentially yielding an effect size for each time point) or when an experiment involves crossed factors (yielding multiple effects sizes under different levels of another factor). Even different experiments from the same paper, or effects from different papers, could be non-independent, as they could have used similar methods, locations, equipment, observers, etc. that could influence the reported effect sizes (Fig. ??c).

7 Taxonomic non-independence

Effect sizes from the same taxon are expected to be more similar to one another than effects measured on more distantly related taxa. More generally, taxa with more shared evolutionary history i.e., closer phylogenetic relatedness, would be expected to have a stronger positive correlation between their observed effect sizes (Fig. ??).

#Methods to address non-independence

5.3 Methods to address non-independence

We typically assume that errors are independent, but there can be cases where this assumption needs to be modified. For example, some observed effects might be associated with studies from closely related organisms, and these studies might have effects that are more similar than studies from less related organisms. The resulting positive correlation in the errors for studies done with related organisms would be a form of non-independence. Different approaches have been taken to tackle possible non-independence. In some cases, the meta-analysis might discard some effect sizes (e.g., choosing to retain only the final time point from an experiment) or they might do a preliminary analysis to collapse multiple non-independent effects into a single estimate (Song et al. 2020). It is not, however, necessary to discard data; there are better approaches. The best approach depends on the nature of non-independence. In general, we can recognize two forms of non-independence, stemming from correlated errors, depending upon the source of the correlation:

Correlated observation errors (e.g., arising from a shared control). This non-independence arises because of correlations that arise because effects are measured imprecisely, and the same measurement errors influence more than one observed effect size (e.g., the errors in estimating the control mean are common to the two effects when you compare two treatments to the same control). These problems are best solved by specifying the covariances among observed effects based on particular features of the studies (e.g., whether two observed effects sizes are calculated from a shared control mean).

Correlated among-study errors. Another form of non-independence arises even when there is no uncertainty in measuring effects. Instead, a form of the covariances (e.g., by using a matrix of phylogenic distances and assuming correlations are related to distance: see Lajeunese (2009) or Nakagawa and Santos (2012), or by specifying groups of observations assumed to be equally correlated (and more correlated than with observations from other groups) using a hierarchical model).

#Hierarchical (miltilevel) models

Hierarchical models can be viewed as an extension of the random effects model, but with the variation in effect sizes being partitioned into additional levels. For example, among-study variability could be viewed as being due to a component common to multiple studies published in the same paper and to variation among papers. In this case studies from the same paper constitute a group and are expected to be more similar to one another (non-independent) than to other studies. This is incorporated into the analysis by including a random group effect (in addition to the variation among studies within a group). This basic idea can be extended to multiple levels, for example studies used in a meta -analysis could be categorized into groups and within groups into subgroups, with the model including random group and subgroup effects.

See Fig. 4 in Nakagawa et al. 2017 for a graphical visualization of a hierarchical model.

When non-independence due to similarity of studies within groups is not explicitly considered, it can cause the uncertainty (i.e., variance) around an effect size estimate to be biased (typically underestimated), which can influence statistical inferences about the effect (e.g., its significance). In fact, a simulation study by Song et al.(2020) found that ignoring non-independence can inflate Type I error rates of meta-analyses by more than 70 %.

7.1 Coding hierarchical models in R

A hierarchical model that accounts for the correlation between observed effect sizes within a paper can be fit in metafor this way:

multilevel_mod <- rma.mv(yi = e_i, vi = var_i, random = list(~1 | paper, ~1 | ID),
    method = "REML", data = dat)

In this model, the random effect of study ID is nested within the random paper effect. This is similar to how simple random effects models are coded, except hierarchical models require the rma.mv() function instead of rma().

To specify a known correlation structure for one of the random factors in a hierarchical model (such as due to the phylogeny of species in the meta-analysis), use the following code:

multilevel_mod2 <- rma.mv(yi = e_i, vi = var_i, random = list(~1 | paper, ~1 | ID), 
                          R = list(ID = cor_mtrx), method = "REML", data = dat)

ID is a unique identifier for each effect size and cor_mtrx is a correlation matrix for random grouping variable ID. This correlation matrix would be calculated based on, say, phylogenetic relatedness (e.g., see Nakagawa and Santos 2012) See the metafor documentation for more details on implementing and models with pre-specified correlations among random group effects.

7.2 Choosing between hierarchical and random effects models

Because of the strong effects that non-independence can have on meta-analysis results, we recommend accounting for it whenever possible in the meta-analysis model and performing sensitivity analyses to evaluate its effects on model estimates.

One method to evaluate the importance of non-independence on meta-analysis results is to compare models with and without a random group effect that accounts for potential non-independence within groups. For example, the analyst could fit models with all relevant fixed effects and then compare among various candidate random-effect structures. Ultimately the choice a meta-analyst makes should be a combination of judgement (prior knowledge of potential sources of variation) and goodness of fit: does the model makes sense and do the data support it?

8 Next steps: Explore heterogeneity among groups

After determining the type of meta-analysis model that is most appropriate for the question and the data, and calculating the mean effect size and its variance, you can explore sources of heterogeneity in the effect sizes across studies:, i.e., evaluate the importance of specific fixed factor covariates in your data. Continue to the next section on Exploring heterogeneity to learn how to do this.

Last updated: 2023, 2023-04-10ebruary 12

9 Useful resources for `metafor`:

10 References

Borenstein, Michael, Larry V Hedges, Julian PT Higgins, and Hannah R Rothstein. 2009. Introduction to Meta-Analysis. John Wiley & Sons, Ltd.

Hedges, L. V. & Olkin, I. (1985). Statistical Methods for Meta-Analysis. Orlando, FL: Academic Press

Jackson, D., Law, M., Rücker, G. & Schwarzer, G. 2017. The Hartung-Knapp modification for random-effects meta-analysis: A useful refinement but are there any residual concerns? Statistics in Medicine 36, 3923–3934.

Koricheva, Julia, Jessica Gurevitch, and Kerrie Mengersen. 2013. Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press. https://muse.jhu.edu/book/41629 .

Lajeunesse. 2009. Meta-Analysis and the Comparative Phylogenetic Method. American Naturalist. 174, 369–381. DOI: 10.1086/603628.

Mar´ın-Mart´ınez, F. and J. Sa´nchez-Meca. 2010. Weighting by inverse variance or by sample size in random-effects meta-analysis. Educational and Psychological Measurement. 70:56– 73.

Nakagawa, S. & Santos, E.S. 2012. Methodological issues and advances in biological meta- analysis. Evolutionary Ecology. 26, 1253–1274.

Nakagawa, S., Noble, D. W. A., Senior, A. M. & Lagisz, M. 2017. Meta-evaluation of meta-analysis: ten appraisal questions for biologists. 1–14. doi:10.1186/s12915-017-0357-7

Noble et al.2017. Nonindependence and sensitivity analyses in ecological and evolutionary meta-analyses. Molecular Ecology. 26, 2410–2425.

Pappalardo et al.2020. Comparing traditional and Bayesian approaches to ecological meta-analysis. Methods in Ecology and Evolution. 11:1286–1295. DOI: 10.1111/2041-210X.13445

Sanchez-Meca, J. and F. Marin-Martinez. 1998. Weighting by inverse variance or by sample size in meta-analysis: A simulation study. Educational and Psychological Measurement. 58:211–220.

Song, C. Peacor, S. D. Osenberg, C. W., and Bence, J. R.. 2020. An assessment of statistical methods for non‐independent data in ecological meta‐analyses. Ecology 101( 12):e03184. 10.1002/ecy.3184

Statistical models

There are 3 main types of statistical models used for meta-analysis: 1) fixed effect models (aka common effect), 2) random effects models, and 3) hierarchical (aka multilevel) models.

The type of statistical model you plan to use should be based on characteristics of the primary studies that you use in your meta-analysis, such as consistency in study design and system, non-independence of effect sizes measured within study, etc. Here’s a quick summary of the different types of statistical models. We’ll break them down in more detail in the following sections, and talk about how weighting, mean effect sizes, and variance calculations work for each type of model, as well as when it is most appropriate to use each one.

Fixed effect models
- assume all effect size are independent (i.e. 1 per study)
- all effect sizes share a common true mean (so variation among studies is due to sampling error, not variation in the true effect size across systems, locations, species, etc.)
Random effects models
- assume studies come from different populations, and thus have different true means
- require calculating among-study variance ((T^2)) as well as within-study variance ((v_{i}))
Hierarchical (multilevel) models
- deals with non-independence, e.g. multiple effect sizes extracted from a single study

Fixed effect models

In fixed effect models, you assume all studies share a single true effect size (a common effect across studies, hence it is called a fixed effect model and not a fixed effects model). This means that the differences in the observed effect sizes of each primary study from the true effect size are due to sampling error. This type of model is appropriate when you want to make statistical inferences about studies that are conducted in the same way, such as studies using a standardized experimental design with the same study species and response measurement type (and so you believe that all the predictor variables of interest and covariates are consistent among studies). It may also be appropriate if you believe there is very little to no variation among studies in the estimated effect size. Formal statistical tests exist to assess the latter condition, which we explain in the [link: Exploring heterogeneity] section.

Fixed effect models can be visualized this way (Fig 1):

A fixed effect model. Green diamonds represent the true effect size, which is the same across studies. Blue circles represent the observed effect size in a study. The width of the distributions for individual studies indicate within-study variance (a wider distribution = greater variance). The distance between the blue circle and the green diamond in each study represents its sampling error.

Estimating study weights

Fixed effect models use the observed effect sizes from primary studies to estimate the true overall effect size. This estimate can be improved by weighting the contribution of each observed effect size to the estimated overall effect size by the precision of the study that the observed effect size came from, using proxies of precision such as within-study variance, sample size, or some other metric of study quality defined by the meta-analyst. Using this method, studies with less precision (i.e. more sampling error) contribute less to the overall estimate of the effect size. However, estimation of the overall mean effect size can also be done using unweighted analysis, which does not take into account study precision. We will discuss when unweighted vs. weighted analyses are most appropriate, as well as options for different weighting schemes in the [link: Weighting effect sizes] module. Generally though, the most common weighting scheme in fixed effect models is the inverse of the within-study variance : [w_i = frac{1}{v_i}]
This weighting metric is also considered the most efficient by statisticians, so we recommend its usage in meta-analysis whenever it is possible and/or appropriate to do so.

Estimating the overall mean effect size and its variance

After calculating the effect size observed in each primary study ((e_i)) and calculating the effect size weights ((w_i)), the overall mean effect size ((bar{E})) can be calculated, which is an estimate of the overall true effect size.

The overall mean effect size ((bar{E})): [bar{E} = frac{sum_{i=1}^{k}w_{i}e_{i}}{sum_{i=1}^{k}w_{i}}]

i.e. the sum of all the observed individual study effect sizes ((e_{i})) times their weights ((w_{i})), divided by the sum of their weights

Following the calculation of this overall mean effect, its variance ((V(bar{E}))) can be calculated as the reciprocal of the sum of the weights of each individual effect size.

The variance of the estimated overall mean effect size ((V(bar{E}))): [V(bar{E}) = frac{1}{sum_{i=1}^{k}w_{i}}]

You can also convert the variance to standard error or confidence intervals, using the following equations.

Standard error (SE) : [SE_{bar{E}} = sqrt{V(bar{E})}]

95% Confidence intervals:

[upper~limit = bar{E} + 1.96 (SE_{bar{E}})] [lower~limit = bar{E} – 1.96 (SE_{bar{E}})]

To test whether the mean effect size is signficantly different from the null hypothesis ((H_0)), you can calculate a z-statistic: [ z = frac{bar{E}}{SE_{bar{E}}}]

Random effects models

Random effects models assume that separate studies represent different populations, and therefore there is variation in the true effect size across studies. (This in contrast to fixed effect models, where individual studies represent samples of the same overall population, and thus the true effect size is assumed to be the same across studies.) The assumptions of random effects models are realistic for most biological and ecological systems, and thus this type of model is often recommended over a fixed effects model. For example, in ecological studies there are often differences across studies in study conditions, season, location, the species or ecosystem being studied, etc., which could lead to variation in the true effect size occuring in each study system. However, one limitation of random effects models is they need enough primary studies to accurrately estimate among-study variance (so approximately 10 studies).

Random effects models can be visualized this way (Fig 2):

A random effects model. Green diamonds represent the true effect size, which differs across primary studies. Blue circles represent the observed effect size in a primary study. In this model type, because the true effects vary across studies, the overall mean effect size has a distribution of values. Note: The distribution of (bar{E}) is too wide in this fig.

Estimating the among-study variance and study weights

After calculating effect sizes ((e_i)) from individual primary studies, you should calculate their variances ((V^*)), which include both within- and among- study components in a random effects model. These variances are then combined to provide the weights ((w^*_{i})) for individual studies. (Note: (^*) indicates random effects version of estimate of weight, variance, mean effect size, etc.)

[w^*_{i}= frac{1}{V^* (e_i)}]

You use the observed effect size from each study ((e_i)), times the inverse of the combined within- and among-study variance ((V^*(e_i))).

[V^* (e_i)= V(e_i ) + T^2]

(e_i) = effect size of (ith) study
(V(e_i)) = within-study variance of (e) for the (ith) study
(T^2) = estimate of among-study variance

Another way of thinking about (V^*(e_i)) is: [V^* (e_i)= V_{within} + V_{among}]

The among-study variance ((T^2)) can be estimated in a variety of ways, and the exact formula you should use will depend on the type of model you use, characteristics of your data, etc. One common method for non-hierarchical models is the moments method, also known as the DerSimonian and Laird method, which is calculated in the following way:

[T^2= (Q-df)/C]

[df = k-1] [C = sum_{(i=1)}^kw_i – (sum_{(i=1)}^kw_i^2 ) / (sum_{(i=1)}^kw_i)] [Q= sum_{(i=1)}^k w_i e_i^2 – (sum_{(i=1)}^k w_i e_i )^2 / sum_{(i=1)}^k w_i )]

Where:

(e_i) is the observed effect size for study (i)
(w_i) is the weight for study (i) (based on within-study variance)
(k) is the number of studies

[Future: add section on other methods for calculating among-study variance]

Note about weights in random models:

Inverse variance weights (as shown above), are typically used in random effects models. However, it is possible to use other weighting schemes such as sample size, as long as the alternative weighting metric is reasonably related to within-study variance. We will discuss the theory behind using non-variance based weights in greater detail in the [Weighting effect sizes module].

Estimating the overall mean effect size and its variance

Estimating the overall mean effect size as well as the variance of this mean effect both have the same formula in random effects models as in fixed effect models, except for the weights that you plug in to each formula. In fixed effect models the weights ((w_i)) only incorporate a within-study variance term. However, in a random effects model the weights ((w^*_i)) incorporate both within- and among study variance terms.

The overall mean effect size ((bar{E}^*)):

[bar{E}^* = frac{sum_{i=1}^{k}w^*_{i}e_{i}}{sum_{i=1}^{k}w^*_{i}}]

The variance of the overall mean effect size ((V(bar{E}^*))):

[V(bar{E}^*) = frac{1}{sum_{i=1}^{k}w^*_{i}}] Standard error (SE), 95% confidence intervals, and z-statistics are calculated the same way as in fixed effect models.

Hierarchical (i.e. multilevel) models

Hierarchical models are similar to random effects models, but consider non-independence of effect sizes (e.g., when multiple effect sizes were extracted from individual papers, or when taxonomic non-independence may influence effect sizes, such as when you think the observed effect sizes might be more similar in closely related species). Hierarchical models deal with non-independence using random factors (such as experiment within a paper) nested within random factors (such as paper).

[Future: add figure illustrating the hierachical model – See [Nakagawa et al. 2017] for an example.]

In general cases, such as when there are multiple effect sizer per paper, they can be coded in metafor this way:

multilevel_m <- rma.mv(yi = e_i, vi = var_i, random = list(~1 | paper, ~1 | id),
    method = "REML", data = dat)

Note that this is similar to how random effects models are coded, except hierarchical models require the rma.mv() function instead of rma().

Taxonomic non-independence will be dealt with in the following sections:

[Future: phylogenetic meta-analysis vignette]

More complicated forms of non-independence

[Future: add more about other forms of non-independence & how to deal with them (J. Bence & Chao)]

Choosing between fixed and random models

Generally, the best practice is to choose the statistical model for your meta-analysis based on the characteristics of the primary studies in your data set and whether or not is it is reasonable to assume that they all share the same true effect size. (In the latter case, it would be appropriate to use a fixed effect model). However, there are times when a researcher would prefer to use a fixed effect model, such as when there are very few primary studies (e.g. < 10) in the meta-analysis, which can lead to inaccurate estimates of among-study variance in a random effects model. In this case, there are formal statistical tests to determine if using a fixed effect model is appropriate for your data. These tests of heterogeneity are based on whether the among-study variance is negligible (i.e. if it is close enough to zero to assume that the true effect size is the same across studies). Common approaches use (Q) or (I^2) statistics, but Bayesian and likelihood approaches exist as well.

The practice of using heterogeneity tests to determine the meta-analysis model type is highly disputed for several reasons. First, heterogeneity tests often suffer from low power to detect differences among studies, and thus they can fail to detect real differences in the true effect sizes across studies in a meta-analysis dataset (Borenstein et al. 2009). Consequently, if the meta-analyst proceeds with a fixed effect model after failing to detect heterogeneity when there actually is a difference in the true effect among studies, then this could bias the model results. Specifically, using a fixed effect model is like assuming that (T^2 = 0) in a random effects model. However, if the true effects differ among studies, this (T^2) term should be > 0. By not accounting for this non-zero (T^2), using a fixed effect model reduces the estimated variance of the mean effect size, and thus artificially enhances power to detect significant differences in the overall mean effect size from the null.

The second major reason against using heterogeneity tests to determine whether the meta-analysis model should be fixed or random is a primarily philosophical one. When the primary studies differ in a single or in multiple ecologically or biologically relevant factors, most ecologists would a priori acknowledge that there are likely differences between studies that would contribute to differences in the real effect sizes among them (even in the absence of data). For example, it would be reasonable to assume that the effects of an invasive plant species on a native competitor might differ in studies conducted in different seasons, or in different locations (which may differ in a variety of factors that could influence the strength of the interaction between the species). Therefore, tests of heterogeneity are generally not advisable in many circumstances (see Koricheva, Gurevitch, and Mengersen 2013, pg. 105; and Borenstein et al. 2009, pg. 84, for more details on this controversy).
However, if you still want to use a statistical test to see if it could be appropriate to use a fixed effect model for your data, you can first run a fixed effect model and then perform a test of heterogeneity to determine if it is appropriate for your data. See our next section on [link: Exploring heterogeneity] to learn more about how to perform these analyses.

Calculating confidence intervals & testing significance of the mean effect size

After calculating your overall mean effect size ((bar{E})) and its variance, you can estimate its confidence interval (CI) and perform statistical tests to determine whether the mean effect is significantly different from the mean expected under the null hypothesis. If the CI doesn’t overlap with the null mean, you can reject the null hypothesis and assume there is an effect of the predictor variable of interest. Example: In the forest plot below (Fig. 3), the 95% CI (indicated by the error bars and the numbers in brackets) is given for the estimated mean effect sizes, calculated as log-odds ratios, for three different study types. When the CI’s do not overlap 0, this indicates there is a significant effect of the factor of interest.

A forest plot of the effect algae on coral fluorescence, calculated as a log-odds ratio, in different study types.

Note about calculating CI for small sample sizes

Traditional meta-analyses use the large-sample approximation, meaning the CI’s are calculated based on the variance and the z-statistic, the latter of which is based on the standard normal (i.e., z-), distribution (REF). However, when there are few studies in a meta-analysis (< ~ 20), this assumption can lead to biased estimates of the CI, causing them to be smaller than they should be (Pappalardo et al. in prep). One way to correct for this small-sample size bias is to use a t-distribution, rather than a z-distribution to make inferences about the mean effect. At smaller sample sizes, using the t-distribution leads to larger estimates of error, but when sample sizes increase, this approach yields very similar estimates to those produced using the z-distribution (Jackson et al. 2016). In metafor this approach can be implemented by specifying test="knha" when defining the model, which tells metafor to use the Knapp-Hartung correction. E.g.

rma_mod <- rma(yi= e_i, vi= var_i, mods =  ~ fix.eff.1 + fix.eff.2 + (1|rand.eff1),
               data=dat, test="knha")

Another method to generate more conservative (and reliable) CI’s when you have a small sample size, is the bootstrapping approach. See the [Future: Bootstrapping vignette] to learn more about how to do this.

Next step: Explore heterogeneity among groups

After calculating your overall mean effect size and its variance, you can explore sources of heterogeneity in the effect sizes across studies, i.e. evaluate the importance of specific groups or covariates in your data. Continue on to the next section on [link: Exploring heterogeneity] to learn how to do this.

Last updated: 2019, January 21

References

Nakagawa, S., Noble, D. W. A., Senior, A. M. & Lagisz, M. 2017. Meta-evaluation of meta-analysis: ten appraisal questions for biologists. 1–14. doi:10.1186/s12915-017-0357-7
Jackson, D., Law, M., Rücker, G. & Schwarzer, G. 2017. The Hartung-Knapp modification for random-effects meta-analysis: A useful refinement but are there any residual concerns? Statist. Med. 36, 3923–3934.

Borenstein, Michael, Larry V Hedges, Julian PT Higgins, and Hannah R Rothstein. 2009. Introduction to Meta-Analysis. John Wiley & Sons, Ltd.

Koricheva, Julia, Jessica Gurevitch, and Kerrie Mengersen. 2013. Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press. https://muse.jhu.edu/book/41629.

Statistical models

1 Introduction

2 Fixed effect models

2.1 Study weights for FE models

2.2 The mean effect size and error in FE models

2.3 Coding FE models in R

3 Random effects models

3.1 Study weights for RE models

3.2 The mean effect size and error in RE models

3.3 Coding RE models in R

4 Choosing between fixed and random models

5 Non-independence

6 Non-independence due to study design

7 Taxonomic non-independence

7.1 Coding hierarchical models in R

7.2 Choosing between hierarchical and random effects models

8 Next steps: Explore heterogeneity among groups

9 Useful resources for metafor:

10 References

Statistical models

Fixed effect models

Estimating study weights

Estimating the overall mean effect size and its variance

Random effects models

Estimating the among-study variance and study weights

Estimating the overall mean effect size and its variance

Hierarchical (i.e. multilevel) models

More complicated forms of non-independence

Choosing between fixed and random models

Calculating confidence intervals & testing significance of the mean effect size

Note about calculating CI for small sample sizes

Next step: Explore heterogeneity among groups

References

9 Useful resources for `metafor`: