Data Collection: Systematic search and data extraction
1 Introduction
The goal of meta-analysis is to provide a quantitative summary of the available evidence on a topic. This summary can be used to make broad generalizations about an ecological process or effect, as well as to identify specific variables that can influence the size of the effect. Because of this, meta-analysis results can influence accepted theory and the focus of future research within a field of science. Therefore, the data used for meta-analysis should be collected in a systematic and reproducible way from the available scientific literature.
Data collection for meta-analysis has the following steps, which we will describe in detail in this module:
- Systematic literature search (includes preliminary & “final” searches)
- Paper screening:
- Preliminary paper screening (abstracts & titles)
- Secondary screening (full papers)
- Data extraction
- Data preparation:
- Unit and measurement type conversions
- QA/QC
- Effect size and within-study variance calculations (see the Effect sizes module)
- Documentation to ensure reproducibility
2 Systematic literature search
After determining the specific research question or questions that you to plan to investigate using meta-analysis, you should then consider what types of data you will need to collect. For example, what types of primary studies address your research question? What types of data will you extract from them? And, what analyses will you need to perform to test those questions?
After answering these questions, you can begin to design how you will conduct your systematic search of the available literature on a specific research area or topic. A systematic search is a formal and hopefully reproducible approach to finding relevant studies to the specific research question. Decisions made during the literature search influence which studies you find, and therefore have an influence on meta-analysis results. Therefore, it is important to thoroughly document all search procedures, such as where, when, and how you perform your search.
2.1 Data sources
There are a variety of commonly used sources to find studies for meta-analysis. First, there are the published studies, which are commonly found using a database search (more on how to do that later). However, for some research questions, unpublished “grey” literature, such as technical reports, “white papers”, etc., is an important source of data. This type of literature can be found in some databases, on the websites for different governmental agencies, academic institutions, or other organizations.
Unpublished data may also be valuable and necessary to answer specific research questions. Although it is the most difficult to acquire this type of data, it can be done by contacting authors on published papers or reports that mentioned a specific dataset or used data was derived from another dataset that could be used by the meta-analyst. Additionally, some funding agencies have study registries that can be searched for specific topics. After searching these registries, the meta-analyst can contact the study investigators and request unpublished data for use in their meta-analysis. (Study registries are more common in medical fields and through funding agencies like the NIH).
2.2 Database searches
We will now discuss the most common way ecologists find relevant literature to use in meta-analyses: through database searches. Frequently, literature searches go through several stages First, the analyst conducts a preliminary literature search (or searches), which is also known as “scoping” the literature. This is helpful to characterize the size and nature of the literature on a subject, and can be used to determine which database(s) to use, and to hone search terms so that they return the highest number of relevant studies as possible. After this scoping stage, the meta-analyst can do the “official”, or final search.
We will discuss common databases and procedures for performing systematic database searches below.
2.2.1 Common databases
There are many databases that can be used to search for appropriate literature for a meta-analysis. One of the largest databases of the scientific literature is Web of Science, which is commonly used in many ecological and evolutionary meta-analyses. Science Direct, Scopus, AGRICOLA, and other subject-specific databases can also be useful for ecologists. Additionally, academic databases for dissertations and theses that can be useful for finding applicable, unpublished data. One of these databases, Dissertation Abstracts Online, provides records of dissertations from students at US institutions dating back to 1861.
Google Scholar is an alternative to these academic databases. It can be useful for finding more non-academic or unpublished literature. However, this benefit comes with tradeoffs. First, because Google Scholar often provides many more search returns than other databases, this means the meta-analyst must sift through more records, many of which are often irrelevant or not useable for meta-analysis. Additionally, searches on Google Scholar are not necessarily as reproducible as academic databases, due to Google’s changes in the availability and ranking of information for different individuals and geographic regions. Thus, the meta-analyst should weight the benefits of finding more obscure studies and data against the potential drawbacks of more irrelevant returns and a more opaque search algorithm before using Google Scholar for their systematic search.
2.2.2 Search terms
The next step in performing a systematic search of the literature is to define a set of specific search terms–keywords, logical operators, and modifiers–that a database can use to find relevant primary studies. Search terms should be specific enough to target appropriate, usable literature, but not so specific that they exclude many relevant articles. Thus, terms should be designed to balance the tradeoffs between maximizing the number of relevant articles found by the database and minimizing the proportion of non-relevant articles returned.
Keywords are words or parts of words that are likely to be used in the title, abstract, or keywords of articles containing data that is relevant to the meta-analysis research question.
Logical operators modify how keywords are used in the search and can be used to refine searches and narrow (or widen) results to more relevant articles. For example:
- OR (Example: facilitation OR mutualism)
- AND (Example: turtle AND tortoise)
- NOT (Example: coral NOT octocoral)
Note that operators can vary by database, so when in doubt, check the database-specific instructions for specifying them.
Modifiers can also be used to expand or refine search results. One useful modifier is the asterisk ((*)). An asterisk ((*)) after a series of letters searches the database for records with words containing the base-word (or letters) before the (*). For example, searching “compet(*)” yields results including the words “competition”, “compete”, “competitive”, etc.,
Quotation marks " "
will return only the exact terms used inside the quotation marks (e.g. “marine” will return results with the word “marine” but not “marines”).
Parentheses ( )
can be used with keywords and logical operators to define specific logical conditions.
Note that, like in mathematics, the order of operations matter when using logical operators and modifiers in database searches.
2.2.3 Search procedures
To conduct a literature search on an academic database, go to your institution’s database list and search for the database you would like to use (e.g., “Web of Science”) or go directly to the webpage for that database, and follow the instructions to login.
Next, input your search terms in the search bar. For example:
((*alga* OR seaweed) AND (coral*) AND (compet* OR interact*))
You may also define a search time, such as all records from the year 2000 to the present (the default is the entire time frame in the database). Next, you can select which part of study records should be searched. Options include “Topic”, which searches study title, keywords, abstract, and is a good default. “Title” only searches the title. “Author” searches the author list. If you would like to narrow your results further, there are also options to refine results to specific document types (e.g., reviews, book chapters, and primary research articles), or from specific subject areas (e.g. environmental sciences) or journals within the database.
After entering all your search settings and clicking “search”, you can skim the search results to make sure there are relevant articles returned. If necessary, you can modify your search procedure (search terms, database used, etc.) to improve the quality and quantity of the search returns. This is known as scoping, or conducting a preliminary literature search. Then, once you are satisfied with the returns from your search, you can conduct your final, “official” search, and export the records for all the returns to a .csv, excel, or text file 2.1).
Figure 2.1: Selecting a file type for exporting search returns from a Web of Science database search.
We generally recommend exporting the Author, Title, Source, and Abstract (Fig. 2.2), which aids in the first screening step. Some databases have limits on the number of returns that can be exported at one time (Web of Science exports 1000 at a time), so you may have to do it in batches.
Figure 2.2: Selecting what content to export from each record in a Web of Science database search.
After exporting your final search results, record your search procedures, including the exact search terms used, database searched, date searched, and any other criteria for refining the search like article type or subject category. Now you are now ready to begin screening papers.
3 Paper screening
After the literature search, the meta-analyst must screen the search returns to find which primary studies can be used in the meta-analysis. This is typically done in two steps. First, a preliminary screening, in which the meta-analyst reviews the titles, abstracts, and keywords of each record returned by the search to identify obviously non-relevant, non-usable articles. If you exported the search results to an excel or csv file, this can be done by going through individual records (rows) in this file. Another useful tool for this preliminary screening step is the R package metagear
(Lajeunesse 2016).
Metagear allows the meta-analyst to:
- Assign unique IDs to all search returns
- Remove duplicate records
- Divide records among multiple reviewers
- Screen abstract, title, and keywords of each record using a graphical user interface (Fig. 3.1)
- Combine efforts of all reviewers (if more than 1) into a single screening results file
- Download papers that pass this preliminary screening step using DOIs for use in the second screening step
Figure 3.1: Graphical user interface for screening abstracts and titles using metagear.
We have example code here to show you how to use metagear for screening, and more details can be found on the metagear website.
After the preliminary paper screening step is the more time-intensive secondary screening. At this stage, the meta-analyst must skim whole papers and use pre-defined eligibility criteria to remove studies that are not relevant to the meta-analysis. Eligibility criteria are essentially rules for deciding which papers (and experiments or studies within papers) to include in the meta-analysis. Examples of common eligibility criteria include:
- Study or experimental design
- Location
- Species
- Study duration (e.g., is the study measuring transient or equilibrium dynamics for the process of interest?)
Eligibility criteria should be chosen through careful consideration of the ecological process of interest and the specific goals of the meta-analysis. Additionally, because they strongly influence which studies will go into the final meta-analysis (and thus, influence its results), they must be applied systematically and consistently to all potential studies, and they must be reported and justified in the methods of the meta-analysis. If evaluating whether a paper meets the eligibility criteria could potentially require some subjective judgment by the reviewer, then each paper could be screened by multiple reviewers.
After this secondary screening step, only papers containing data that can be used in the meta-analysis will remain, and you are ready to begin extracting data. However, before starting that, you should record several details about your literature search to make it as reproducible as possible (just like recording the conditions of an experiment).
4 What to record for a reproducible literature search
- Database(s) used and any other methods for acquiring literature
- Search terms
- Timespan for search
- Date search was performed
- Number of records/papers returned by the search
- Number of papers removed by each screening step
- Eligibility criteria used to include/exclude studies
- Final no. of papers remaining after both screening steps
This information should be provided in the methods or an appendix in the final publication or report. A common format used by many researchers to summarize much of this information is the PRISMA diagram (Fig. 4.1), which stands for Preferred Reporting Items for Systematic reviews and Meta-Analyses (Moher et al. 2009).
Figure 4.1: Basic PRISMA diagram outlining items from systematic search and screening procedures that should be reported for meta-analyses and systematic reviews. (From Moher et al. 2009.) Extensions of this reporting system for the fields of ecology and evolution are available in O’Dea et al. 2021.
PRISMA has been extended by O’Dea et al. (2021) for the fields of ecology and evolution (PRISMA-EcoEvo v. 1.0). Their Table 1 outlines additional information that should be reported to facilitate reproducibility, interpretation, and future use of data from systematic reviews and meta-analyses in ecology and evolution. Additional items that should be reported include equations (or references to equations) for the effect size metric and its variance, the type of meta-analytic model used, the software used for statistical analyses, non-independence in the data and how this was accounted for in analyses, etc.
5 Data extraction
Once you have performed both screening steps on a primary source that was found through the systematic literature search, you can start extracting data from it. (Alternatively, you can wait to extract data until you have performed both screening steps on all the primary sources returned from the systematic literature search.) Data extraction involves setting up a datasheet and then using digital extraction software to acquire the necessary data from each primary source.
5.1 Setting up a datasheet
The first step in data extraction is to prepare your datasheet, based on the type of data you will be extracting to calculate your effect sizes and any other additional variables for your analyses. Data is ideally put into a “tidy” format:
- Each observation has a row (e.g., studies from which effect sizes will be calculated)
- Each variable forms a column
- 1 piece of information per cell
Additionally, you should create a separate meta-data file that describes and explains:
- Each column variable
- The categories (levels) within variables
- The units of measurement for each variable
- Definitions for any abbreviations
- Any additional information that clarifies or provides context to the meta-analysis
Keeping data in a tidy format and providing a clear meta-data file is helpful to maintain consistency and accuracy when multiple people are collaborating on a meta-analysis and extracting data. This organization also helps other researchers who may use your data in the future.
It is useful to set up a datasheet with the following format before extracting data from the primary literature:
- 1 row for each effect size, with separate columns for the treatment groups being compared (e.g., the control and experimental group) in the effect size calculation
- Separate columns for the means, n, and error for each group (3 variables x 2 groups = 6 columns)
- Column defining the response variable
- Column defining error type (e.g., CI, SE, or SD)
- Separate columns for the units of each extracted variable
- Record source of data:
- Paper ID
- Data location: figure, table, or text section that data comes from
- Columns for covariates of interest
- Notes
- Optional:
- Person extracting data
- Reference for paper (or have a separate file linking PaperID to its corresponding reference)
To see an illustration of this format, here is a basic template for a data extraction datasheet, with examples values filled in for some of the cells:
PaperID | data_location | respons_var | response_units | mean_trtmt | mean_control | n_trtmt | n_control | error_trtmt | error_control | error_type | covar_temp | covar_units | notes | reference |
1 | Fig 1 | photosynthesis | mg O2 cm^2 h^-1 | 6.7 | 8 | 19 | 20 | 1.1 | 1.2 | SD | 25 | degrees C | … | Gao et al. 2017 |
1 | Fig 2 | … | … | … | … | … | … | … | … | … | 30 | degrees C | … | Gao et al. 2017 |
2 | Table 2 | … | … | … | … | … | … | … | … | … | … | … | … | Briggs et al. 2019 |
3 | Fig 2a | … | … | … | … | … | … | … | … | … | … | … | … | Rosales et al. 2010 |
3 | Fig 2b | … | … | … | … | … | … | … | … | … | … | … | … | … |
3 | Fig 2c | … | … | … | … | … | … | … | … | … | … | … | … | … |
5.2 How to extract data
Data can be pulled directly from tables, text, or supplementary datasheets for in the manuscript, and entered in the meta-analysis datasheet. If some of the necessary data (e.g., means, error, etc.) are only available in figures, then you can use digital extraction software to estimate those values. There are many extraction software options, many of which are free (see the list below).
- Plot Digitizer (PC & Mac)
- ImageJ (PC & Mac)
- GraphClick (Mac; no longer maintained)
- Metagear package
- Datathief
- WebPlotDigitizer
- …and many more
However, not all of the free options are maintained and updated for new operating systems. See their websites for details.
We’ll show you how to use Plot Digitizer, which operates similarly to other extraction software, and can be downloaded for free here.
Extraction steps in Plot Digitizer:
- Save figure that you want to extract data from as a jpeg, gif, or png
- Open the figure in Plot Digitizer
- Click on the Calibrate button at top of window with figure
- Set the scale by clicking on the minimum and maximum point of the x- and y-axes and enter their corresponding values as prompted (Fig. 5.1)
- To begin extracting data, click the Digitize button at top of window
- Click on the data point, mean, or error bar (upper or lower limit) that you want to extract from the figure
- Continue clicking on features that you want to extract
- When done, click Done at the top of the window (you can do this for 1 or multiple groups at a time)
- Digitized Points window should open with x,y coordinates (Fig. 5.2)
- Enter values in the appropriate columns in the meta-analysis spreadsheet that you created
Figure 5.1: Calibrating axes in Plot Digitizer. Click on the axis minimum or maximum in the figure, and then manually enter the value associated with that point.
Figure 5.2: Plot Digitizer window showing mean and error values extracted for two treatment groups in the figure (the control and algae exposed groups for D. menstrualis). Note that for categorical x-axis variables, you must keep track of the order that points were selected for extraction.
6 Prepare data for analysis
After extracting and entering the data into a spreadsheet, it must be prepared for analysis. This includes performing any necessary calculations to convert measurements from different studies to the same dimensions (e.g., measurement type) and, for some effect size metrics, to the same units, as well as performing quality control checks and corrections, and finally, calculating the observed effect sizes and their variances.
6.1 Units and conversions
Before effect sizes are calculated for each study (i.e., each row of data), the meta-analyst should make any necessary conversions so that the response variable(s), covariates, and the group error measurements are comparable from one study to another. For some effect size metrics (see the Effect size calculation section below) like the response difference, that means that the response variable measurements that are used to calculate the effect sizes should all have the same dimensions (e.g. length vs. mass) and units (e.g. mm vs. cm). For other effect size metrics, like the log response ratio and other ratio-based metrics, units are not important because they cancel out in the effect size calculation. However, dimensions still matter when comparing ratio-based effect size metrics. (See the Effect Sizes module to learn more about when and why dimensions and units are important.)
Additionally, calculating the within-study error of each observed effect size requires the variance for the two response groups being compared by the effect size (e.g., the experimental treatment and the control). However, primary studies vary in how they report these group errors, with some reporting standard deviations, and others reporting standard errors or confidence errors. Therefore, it is helpful to convert all group errors (across all studies) to either standard deviations or variances before calculating effect sizes. Which error type you choose to standardize to can depend on whether you use a pre-made function from a software or R package like metafor
, or if you write your own function. The escalc()
function in metafor
requires standard deviations to calculate effect sizes and their within-study variances.
6.2 QA/QC
Quality assurance and quality control (QA/QC) should be used to identify and correct any errors in the data. This is ideally done before, during, and after data collection.
Before data extraction, QA/QC involves:
- Setting up a clean datasheet/database format
- Defining selection criteria for studies and which data to extract and use in the meta-analysis
- Defining all variables and their levels (if categorical), abbreviations, etc. in the dataset
- (This information should be organized into a meta-data file)
This preparation improves the efficiency and accuracy of data extraction, preparation, and analysis, and makes data more re-usable.
During data extraction, QA/QC involves:
- Keeping track of units
- Using a tidy data format (definition in Data extraction section)
- Updating the datasheet and meta-data as necessary when a new variable, level, measurement type, etc., is encountered during data extraction
Finally, after data extraction, QA/QC involves checking for:
- Missing data
- Transcription errors (e.g., are data in correct columns or rows? Are there typos?)
- Extraction or conversion errors (e.g., is any value unreasonable for that measurement or error type?)
Statistical summaries and/or graphs of the data (e.g., the number of observations/rows, mean values, error, outliers, etc.) can help identify missing data or transcription/extraction errors. Outliers can be checked to ensure that they don’t represent a data entry or extraction mistake that needs to be corrected. Additionally, double entry, which is when two separate researchers extract and enter the same data and then check it for agreement, can be used to ensure data quality. This approach is especially helpful for more subjective or qualitative data (and thus this approach is more common in social science meta-analyses).
6.3 Effect size calculations
After extracting data and performing all conversions and quality control checks, you can calculate the effect sizes and their variances. Effect sizes can be calculated in a variety of different ways depending on the research question and characteristics of the primary data and the ecological process of interest. These different forms of effect sizes are called effect size metrics. There are many considerations that go into the decision of which effect size metric to use for your specific research question, so we have dedicated a whole module to understanding the different effect size metrics and how they can be calculated here.
7 Reproducibility
The many decisions made at different points during the data collection and analysis steps of meta-analyses can have large influences on the results and how they should be interpreted. To make your meta-analysis as transparent, reproducible, and useful to other researchers as possible, document the methods for each step in the meta-analysis process (i.e., the systematic literature search, paper screening, data extraction, data preparation, effect size calculations, statistical analyses, etc.), and report these methods in the final meta-analysis manuscript or its supplementary files.
Specifically, you should provide:
- Information on the literature search and eligibility criteria for studies used in the meta-analysis (see the previous section on What to record for a reproducible literature search)
- A list of the full references for all the papers and other data sources that were used in the meta-analysis
- A description of any conversions or calculations performed on extracted data, including equations (or references to equations)
- A description of analysis methods (statistical models, ways of accounting for non-independence, etc.)
- Software used for data extraction, analysis, etc.
- The data, meta-data (as described in the Data extraction section), and code used for the meta-analysis
Other ways to improve the reproducibility and re-usability of meta-analysis data include keeping the raw or original data and using scripts whenever possible to make conversions and to calculate effects. In the final published meta-analysis, provide this raw data and code, including the group means, sample sizes, and error used to calculate the effect sizes and their variance. (Do not provide only the effect sizes that you calculated.) This allows future researchers to use the data for a variety of purposes, including to evaluate the influence of alternative analysis methods on the meta-analysis results, or to answer a completely different research question from the original meta-analysis.
8 Next steps
Learn how to choose and calculate the effect sizes and their variances for each study in the Effect Sizes module, and then learn about the different statistical models that you can use to analyze the effect size data.
9 References
Lajeunesse, MJ (2016) Facilitating systematic reviews, data extraction, and meta-analysis with the metagear package for R. Methods in Ecology and Evolution, 7, 323-330.
O’Dea, RE et al. (2021) Preferred reporting items for systematic reviews and meta-analyses in ecology and evolutionary biology: a PRISMA extension. Biological Reviews. doi: 10.1111/brv.12721.
Moher D, Liberati A, Tetzlaff J, Altman D G for the PRISMA Working Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 2009; 339:b2535. doi:10.1136/bmj.b2535.