A comprehensive literature review (2014–2025) was conducted to evaluate the reproducibility of microbiome metadata in original research articles. Studies were identified via PubMed using microbiome and sequencing-related keywords. Reviews and meta-analyses were excluded. Each study's sequencing method (16S, shotgun, or both), availability of supplementary metadata (e.g., sample IDs, biological and demographic data), and data accessibility (e.g., GitHub, NCBI, SRA) were recorded.
Researchers systematically extracted and validated metadata and sequencing information using a structured checklist. Key metadata elements—such as sample IDs, age, gender, health status, and environmental context—were analyzed. Metadata reproducibility was assessed based on (1) whether metadata was provided, (2) if sample IDs matched those in SRA explorer, and (3) whether data were differentiable for downstream analysis, including machine learning use.
Validation included cross-checking accession codes and confirming consistency between metadata files and public repositories. Studies were reviewed by multiple individuals to ensure reliability and transparency. The final dataset was used to assess trends in metadata completeness, publication year, and journal impact.
This project investigates how clinical cancer microbiome studies report analytical sample sizes, focusing on clarity, consistency, and machine-readability of “n=” notation across published literature. Using a systematically curated dataset of 2,305 microbiome papers identified through PubMed, it analyzes a refined subset of 67 human cancer studies to examine sample size reporting practices within text-readable manuscripts. By quantifying ambiguities in how sample sizes are defined and contextualized, the project highlights a critical but understudied dimension of reproducibility and data transparency in microbiome science, with implications for meta-analyses, data reuse, and AI-based research synthesis.
This project draws a link between different geographical regions in terms of the last author of microbiome data studies. Using 2305 studies curated by 150+ undergraduates, this project will allow us to link reproducibility with different types of collaboration, essentially investigating to which extent collaboration-- and which extent of collaboration-- correlates with reproducible studies within the microbiome data field. By connecting the geographical links to a paper's reproducibility, this project will outline the importance of collaboration that can now be facilitated in the time of mass globalization.
This bar graph illustrates how the number of papers per year included in our meta-analysis
This figure illustrates nodes from which the studies included in our project stem from. Notably, the United States and China are two countries that are represented the most in our project.
This bullet graph reveals the trend in the type of sequencing used in the papers included in our study. With this graph, an increase in the use of shotgun sequencing is observed up until 2023, when it decreased, but increased again for 2024.
This bullet graph reveals the trends with reproducibility over time. For a study to be reproducible, independent researchers must be able to obtain the same results using the same methods and data. Fully reproducible datasets continue to be very rare, and possibly a crucial concern.
Our analysis included finding a paper's metadata, making sure that the SampleID in SRA/genomic data base matches the metadata, and that the entry in the genomic data base is present and differentiable.
This flow chart depicts the steps taken during out project to decide if studies were replicable.