Curate DataFrames and AnnDatas¶

Curating datasets typically means three things:

Validate: ensure a dataset meets predefined validation criteria
Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers
Annotate: link a dataset against metadata records

In LaminDB, valid metadata is metadata that’s stored in a metadata registry and validation criteria merely defines a mapping onto a field of a registry.

Example

"Experiment 1" is a valid value for ULabel.name if a record with this name exists in the ULabel registry.

# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty

Validate a DataFrame¶

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
        "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
        "donor": ["D0001", "D0002", "D0003"]
    },
    index = ["obs1", "obs2", "obs3"]
)
df

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/anndata/_io/__init__.py:12: FutureWarning: Importing read_zarr from `anndata._io` is deprecated. Please use anndata.io instead.
  warnings.warn(

	temperature	cell_type	assay_ontology_id	donor
obs1	37.2	cerebral pyramidal neuron	EFO:0008913	D0001
obs2	36.3	astrocyte	EFO:0008913	D0002
obs3	38.2	oligodendrocyte	EFO:0008913	D0003

Define validation criteria and create a Curator object.

# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)

The validate() method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

curate.validate()

Register new metadata values¶

If you see “non-validated” values, you’ll need to decide whether to add them to your registries or “fix” them in your dataset.

For cell_type, we saw that ‘cerebral pyramidal neuron’ is not validated, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curate.lookup()` to get a lookup object of existing records in your instance
lookup = curate.lookup(public=True)
lookup

# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron

# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”

# this adds donors that were _not_ validated
curate.add_new_from("donor")

# validate again
validated = curate.validate()
validated

Validate an AnnData¶

Here we additionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3], 
        "ENSG00000276977": [4, 5, 6], 
        "ENSG00000198851": [7, 8, 9], 
        "ENSG00000010610": [10, 11, 12], 
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18]
    }, 
    index=df.index
)

adata = ad.AnnData(X=X, obs=df)
adata

curate = ln.Curator.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)

curate.validate()

Non-validated terms can be accessed via:

curate.non_validated

Subset the AnnData to validated genes only:

adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

Now let’s validate the subsetted object:

curate = ln.Curator.from_anndata(
    adata_validated, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)

curate.validate()

Save a curated artifact¶

The validated object can be subsequently saved as an Artifact:

artifact = curate.save_artifact(description="test AnnData")

Saved artifact has been annotated with validated features and labels:

artifact.describe()

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.