Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Mar 7:2023.03.06.531314.
doi: 10.1101/2023.03.06.531314.

Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

Affiliations

Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

Yichen Wang et al. bioRxiv. .

Update in

Abstract

A large number of genomic and imaging datasets are being produced by consortia that seek to characterize healthy and disease tissues at single-cell resolution. While much effort has been devoted to capturing information related to biospecimen information and experimental procedures, the metadata standards that describe data matrices and the analysis workflows that produced them are relatively lacking. Detailed metadata schema related to data analysis are needed to facilitate sharing and interoperability across groups and to promote data provenance for reproducibility. To address this need, we developed the Matrix and Analysis Metadata Standards (MAMS) to serve as a resource for data coordinating centers and tool developers. We first curated several simple and complex "use cases" to characterize the types of feature-observation matrices (FOMs), annotations, and analysis metadata produced in different workflows. Based on these use cases, metadata fields were defined to describe the data contained within each matrix including those related to processing, modality, and subsets. Suggested terms were created for the majority of fields to aid in harmonization of metadata terms across groups. Additional provenance metadata fields were also defined to describe the software and workflows that produced each FOM. Finally, we developed a simple list-like schema that can be used to store MAMS information and implemented in multiple formats. Overall, MAMS can be used as a guide to harmonize analysis-related metadata which will ultimately facilitate integration of datasets across tools and consortia. MAMS specifications, use cases, and examples can be found at https://github.com/single-cell-mams/mams/.

PubMed Disclaimer

Conflict of interest statement

DECLARATIONS The authors declare that they have no competing interests.

Figures

Figure 1.
Figure 1.. Overview of matrix classes included in MAMS.
Feature and observation matrices (FOMs) contain biological data at different stages of processing including reduced dimensional representations. Feature annotation matrices (FEA) and Observation annotation matrices (OBS) store annotations such as additional IDs or labels, quality control metrics, and cluster labels. The Observation Neighborhood Graph (ONG) and Feature Neighborhood Graph (FNG) classes store information related to the correlation, similarity, or distance between pairs of observations or features, respectively. The Observation ID (OID) and Feature ID classes are used to store unique identifiers for individual observations and features, respectively. The Record (REC) class is a special set of fields for storing information related to data and tool provenance.
Figure 2.
Figure 2.. Matrices produced during a simple analysis workflow for single cell RNA-seq data.
Several steps are often performed in analysis workflows for scRNA-seq data generated with high-throughput devices. The observations are filtered to exclude empty droplets and poor quality cells. Quality control metrics can be stored in an OBS annotation data frame. Preprocessing of the data matrix includes steps for normalization and standardization of features (e.g. z-scoring). From the scaled data, a subset of highly variable genes are used as input into Principal Component Analysis (PCA). The reduced dimensional space of the PCA is used as input into 2D embedding tools such as tSNE and UMAP as well as clustering algorithms such as k-means and Leiden.
Figure 3.
Figure 3.. Example of MAMS list format.
As the ability to implement and store matrix and analysis related metadata is variable across software platforms and data objects, we created a simple list-like structure to capture relevant MAMS fields for each matrix. This structure can be stored in configuration file formats like JSON and YAML or in general metadata or unstructured slots within data objects. Each dataset will have its own entry within the list and each class of matrix has an entry within the list for each dataset. Each matrix is denoted with a unique ID and MAMS fields are denoted with key-value pairs under each matrix. The additional fields specified within this implementation including filepath and accessor can be used to point to matrices stored in any flat file format or within a data object.

References

    1. Regev A. et al. The Human Cell Atlas. Elife 6, (2017). - PMC - PubMed
    1. HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019). - PMC - PubMed
    1. Rozenblatt-Rosen O. et al. The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell 181, 236–249 (2020). - PMC - PubMed
    1. Li H. et al. Fly Cell Atlas: A single-nucleus transcriptomic atlas of the adult fruit fly. Science 375, eabk2432 (2022). - PMC - PubMed
    1. Plant Cell Atlas Consortium et al. Vision, challenges and opportunities for a Plant Cell Atlas. Elife 10, (2021). - PMC - PubMed

Publication types