Curate DataFrames and AnnDatas

Curating datasets typically means three things:

  1. Validate: ensure a dataset meets predefined validation criteria

  2. Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers

  3. Annotate: link a dataset against metadata records

In LaminDB, valid metadata is metadata that’s stored in a metadata registry and validation criteria merely defines a mapping onto a field of a registry.

Example

"Experiment 1" is a valid value for ULabel.name if a record with this name exists in the ULabel registry.

# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty
Hide code cell output
→ connected lamindb: testuser1/test-curate

Validate a DataFrame

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
        "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
        "donor": ["D0001", "D0002", "D0003"]
    },
    index = ["obs1", "obs2", "obs3"]
)
df
Hide code cell output
→ connected lamindb: testuser1/test-curate
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/anndata/_io/__init__.py:12: FutureWarning: Importing read_zarr from `anndata._io` is deprecated. Please use anndata.io instead.
  warnings.warn(
temperature cell_type assay_ontology_id donor
obs1 37.2 cerebral pyramidal neuron EFO:0008913 D0001
obs2 36.3 astrocyte EFO:0008913 D0002
obs3 38.2 oligodendrocyte EFO:0008913 D0003

Define validation criteria and create a Curator object.

# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)
Hide code cell output
✓ added 3 records with Feature.name for columns: 'cell_type', 'assay_ontology_id', 'donor'

The validate() method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

curate.validate()
Hide code cell output
• saving validated records of 'cell_type'
✓ added 2 records from public with CellType.name for cell_type: 'oligodendrocyte', 'astrocyte'
• saving validated records of 'assay_ontology_id'
• mapping cell_type on CellType.name
!    1 term is not validated: 'cerebral pyramidal neuron'
→ fix typo, remove non-existent value, or save term via .add_new_from('cell_type')
✓ 'assay_ontology_id' is validated against ExperimentalFactor.ontology_id
• mapping donor on ULabel.name
!    3 terms are not validated: 'D0001', 'D0002', 'D0003'
→ fix typos, remove non-existent values, or save terms via .add_new_from('donor')
False

Register new metadata values

If you see “non-validated” values, you’ll need to decide whether to add them to your registries or “fix” them in your dataset.

For cell_type, we saw that ‘cerebral pyramidal neuron’ is not validated, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curate.lookup()` to get a lookup object of existing records in your instance
lookup = curate.lookup(public=True)
lookup
Hide code cell output
Lookup objects from the public:
 .cell_type
 .assay_ontology_id
 .donor
 .columns
 
Example:
    → categories = validator.lookup()['cell_type']
    → categories.alveolar_type_1_fibroblast_cell

To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Hide code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0000598', 'CL:0010012'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”

# this adds donors that were _not_ validated
curate.add_new_from("donor")
Hide code cell output
✓ added 3 records with ULabel.name for donor: 'D0002', 'D0001', 'D0003'
# validate again
validated = curate.validate()
validated
Hide code cell output
• saving validated records of 'cell_type'
✓ 'cell_type' is validated against CellType.name
✓ 'assay_ontology_id' is validated against ExperimentalFactor.ontology_id
✓ 'donor' is validated against ULabel.name
True

Validate an AnnData

Here we additionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3], 
        "ENSG00000276977": [4, 5, 6], 
        "ENSG00000198851": [7, 8, 9], 
        "ENSG00000010610": [10, 11, 12], 
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18]
    }, 
    index=df.index
)

adata = ad.AnnData(X=X, obs=df)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 6
    obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
curate = ln.Curator.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)
curate.validate()
Hide code cell output
• saving validated records of 'var_index'
✓ added 5 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000153563'
• mapping var_index on Gene.ensembl_gene_id
!    1 term is not validated: 'ENSGcorrupted'
→ fix typo, remove non-existent value, or save term via .add_new_from_var_index()
✓ 'cell_type' is validated against CellType.name
✓ 'assay_ontology_id' is validated against ExperimentalFactor.ontology_id
✓ 'donor' is validated against ULabel.name
False

Non-validated terms can be accessed via:

curate.non_validated
Hide code cell output
{'var_index': ['ENSGcorrupted']}

Subset the AnnData to validated genes only:

adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

Now let’s validate the subsetted object:

curate = ln.Curator.from_anndata(
    adata_validated, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)

curate.validate()
Hide code cell output
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'cell_type' is validated against CellType.name
✓ 'assay_ontology_id' is validated against ExperimentalFactor.ontology_id
✓ 'donor' is validated against ULabel.name
True

Save a curated artifact

The validated object can be subsequently saved as an Artifact:

artifact = curate.save_artifact(description="test AnnData")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
!    1 unique term (25.00%) is not validated for name: temperature

Saved artifact has been annotated with validated features and labels:

artifact.describe()
Hide code cell output
Artifact(uid='zv0kSXZXFyD8AS2x0000', is_latest=True, description='test AnnData', suffix='.h5ad', type='dataset', size=20336, hash='8z6kAdTVBaDIDuA6aivzNg', n_observations=3, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-11 14:17:11 UTC)
  Provenance
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-curate'
    .created_by = 'testuser1'
  Labels
    .cell_types = 'oligodendrocyte', 'astrocyte', 'cerebral cortex pyramidal neuron'
    .experimental_factors = 'single-cell RNA sequencing'
    .ulabels = 'D0002', 'D0001', 'D0003'
  Features
    'assay_ontology_id' = 'single-cell RNA sequencing'
    'cell_type' = 'astrocyte', 'cerebral cortex pyramidal neuron', 'oligodendrocyte'
    'donor' = 'D0001', 'D0002', 'D0003'
  Feature sets
    'var' = 'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
    'obs' = 'cell_type', 'assay_ontology_id', 'donor'

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

  1. Defining validation criteria

  2. Validating data against existing registries

  3. Adding new validated entries to registries

  4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.