Microbiome Metadata Crisis (MMC) Project

Abstract

The human microbiome is vital to health, yet despite the availability of standards such as MIxS and STORMS, the field lacks centralized, AI-ready structured data and metadata. Large-scale data analysis is hindered by a Microbiome Metadata Crisis, where critical variables such as disease status are hidden in inconsistent, study-specific terminology or never shared at all. This fragmentation prevents the use of AI for large-scale discovery of microbiome-driven therapies. While a wealth of microbiome data exists on public repositories, it is unclear what percentage of this data is reusable (i.e. sample fastq files containing proper annotations of source environment, treatment group, and metadata-linked sample IDs). To quantify AI-readiness in the microbiome field, we conducted a large-scale review of 2,300 human microbiome papers spanning diverse geographies, demographics, disease contexts, and journal impact factors. We recorded whether studies provided accession codes for their sequence data, and whether critical annotations for data reuse were present in the public repository or linked to the manuscript itself. Our analysis reveals a major disconnect between data sharing and utility. Although 63.3% of studies provide public accession codes, still shockingly low in light of widespread journal and funding agency mandates to release the DNA sequence data, only 10.8% provided sequence data with biological annotations relevant to the study. An additional 11.1% of studies were reusable when considering custom-author uploaded metadata on repositories, and an additional 8.0% were reusable when also including custom author-uploaded metadata provided in their manuscripts. Public availability does not equate to scientific utility. To resolve this data reusability crisis, we provide concrete steps to guide the field towards AI-readiness. Harmonizing our wealth of public microbiome data is essential for AI-driven drug discovery, therapeutics and public health recommendations that leverage the microbiome.

PEOPLE (280 total undergraduates involved)

Here is a snapshot of a typical MMC meeting

Figures below showcasing our meta-analysis across various factors. We sampled over 2300 human microbiome papers in just under 4 months, recording over 60 metadata variables per study! Stay tuned for our 2026 publications coming soon.

Human microbiome publications over time

This bar graph illustrates how the number of papers per year included in our meta-analysis. Notice the rapid rise in the late 2010s as hight-hroughput sequencing becomes more available.

Global diversity of human microbiome research

This figure illustrates country of origin from which the studies included in our project stem from. Notably, the United States and China are two countries that are represent a huge proportion of microbiome publications.

Sequencing types over time

This bullet graph reveals the trend in the type of sequencing used in the papers included in our study. Notice the consistent popularity of 16S despite its growing age as a technology.

REUSABILITY OVER TIME

This bar graph reveals the trends with reusability over time. For a study to be reusability, independent researchers must be able to obtain the same results using the same methods and data. Fully reproducible datasets continue to be very rare, despite efforts like STORMS and MIxS, and is a crucial concern.

DISTRIBUTION OF REPRODUCIBILITY ACROSS TIME IN SAMPLED PAPERS

Our analysis included finding a paper's metadata, making sure that the SampleID in SRA/genomic data base matches the metadata, and that the entry in the genomic data base is present and differentiable.

FLOW CHART

This flow chart depicts the steps taken during out project to decide if studies were replicable.

SUBPROJECTS

Assessing the Clarity of Sample Size Reporting in Clinical Cancer Microbiome Research

This project investigates how clinical cancer microbiome studies report analytical sample sizes, focusing on clarity, consistency, and machine-readability of “n=” notation across published literature. Using a systematically curated dataset of 2,305 microbiome papers identified through PubMed, it analyzes a refined subset of 67 human cancer studies to examine sample size reporting practices within text-readable manuscripts. By quantifying ambiguities in how sample sizes are defined and contextualized, the project highlights a critical but understudied dimension of reproducibility and data transparency in microbiome science, with implications for meta-analyses, data reuse, and AI-based research synthesis.

Examining Transregional Collaboration

This project draws a link between different geographical regions in terms of the last author of microbiome data studies. Using 2305 studies curated by 150+ undergraduates, this project will allow us to link reproducibility with different types of collaboration, essentially investigating to which extent collaboration-- and which extent of collaboration-- correlates with reproducible studies within the microbiome data field. By connecting the geographical links to a paper's reproducibility, this project will outline the importance of collaboration that can now be facilitated in the time of mass globalization.

Neurodegenerative microbiome meta-analysis

Here, we sample hundreds of neurodegenerative microbiome papers and synthesize a per-disease core microbiome, if any, that will provide researchers and clinicians with an invaluable resource, as currently there is absolutely no consensus on how the microbiome impacts diseases like Alzheimer's and Parkinson's, despite the wealth of papers on the subject.

Page updated

Google Sites

Report abuse