Curate datasets of any format

Our previous guide explained how to validate, standardize & annotate DataFrame and AnnData. In this guide, we’ll walk through the basic API that lets you work with any format of data.

How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from CanValidate.

CanValidate methods validate against the registries in your LaminDB instance. In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Record.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --schema bionty
Hide code cell output
→ connected lamindb: testuser1/test-curate-any
import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.create((10,), dtype=[('value', 'f8'), ("gene", "U15"), ('disease', 'U16')], store='data.zarr')
data["gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703", "ENSG00000157764", "ENSG00000171862", "ENSG00000091831", "ENSG00000141736", "ENSG00000133056", "ENSG00000146648", "ENSG00000118523"]
data["disease"] = np.random.choice(['MONDO:0004975', 'MONDO:0004980'], 10)
→ connected lamindb: testuser1/test-curate-any
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/anndata/_io/__init__.py:12: FutureWarning: Importing read_zarr from `anndata._io` is deprecated. Please use anndata.io instead.
  warnings.warn(

Define validation criteria

Entities that don’t have a dedicated registry (“are not typed”) can be validated & registered using ULabel:

criteria = {
    "disease": bt.Disease.ontology_id,
    "project": ln.ULabel.name,
    "gene": bt.Gene.ensembl_gene_id,
}

Validate and standardize metadata

validate() validates passed values against reference values in a registry. It returns a boolean vector indicating whether a value has an exact match in the reference values.

bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
! Your Disease registry is empty, consider populating it first!
   → use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False, False, False, False, False, False, False,
       False])

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use standardize() to standardize synonyms.

bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);
! received 2 unique terms, 8 empty/duplicated terms are ignored
! 2 unique terms (100.00%) are not validated for ontology_id: MONDO:0004975, MONDO:0004980
   detected 2 Disease terms in Bionty for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
→  add records from Bionty to your Disease registry via .from_values()

Following the suggestions to register new labels:

Bulk creating records using from_values() only returns validated records:

Note: Terms validated with public reference are also created with .from_values, see Manage biological registries for details.

diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)

Repeat the process for more labels:

projects = ln.ULabel.from_values(
    ["Project A", "Project B"], 
    field=ln.ULabel.name, 
    create=True, # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)

Annotate and save dataset with validated metadata

Register the dataset as an artifact:

artifact = ln.Artifact("data.zarr", description="a zarr object").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Link the artifact to validated labels. You could directly do this, e.g., via artifact.ulabels.add(projects) or artifact.diseases.add(diseases).

However, often, you want to track the features that measured labels. Hence, let’s try to associate our labels with features:

from lamindb.core.exceptions import ValidationError

try:
    artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
    print(e)
Hide code cell output
! cannot infer feature type of: [ULabel(uid='MfdR4qxu', name='Project A', created_by_id=1, created_at=2024-11-11 14:18:13 UTC), ULabel(uid='VUNpe98C', name='Project B', created_by_id=1, created_at=2024-11-11 14:18:13 UTC)], returning '?
! cannot infer feature type of: [Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimers disease|Alzheimer's dementia|Alzheimer's disease|Alzheimers dementia|AD|presenile and senile dementia|Alzheimer dementia|Alzheimer disease', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, source_id=49, created_at=2024-11-11 14:18:13 UTC), Disease(uid='4JmTj6Sn', name='atopic eczema', ontology_id='MONDO:0004980', synonyms='allergic dermatitis|Atopic dermatitis|allergic form of dermatitis|Besnier's prurigo|Atopic neurodermatitis|eczema|allergic|atopic eczema|eczematous dermatitis', description='A Chronic Inflammatory Genetically Determined Disease Of The Skin Marked By Increased Ability To Form Reagin (Ige), With Increased Susceptibility To Allergic Rhinitis And Asthma, And Hereditary Disposition To A Lowered Threshold For Pruritus. It Is Manifested By Lichenification, Excoriation, And Crusting, Mainly On The Flexural Surfaces Of The Elbow And Knee. In Infants It Is Known As Infantile Eczema.', created_by_id=1, source_id=49, created_at=2024-11-11 14:18:13 UTC)], returning '?
These keys could not be validated: ['project', 'disease']
Here is how to create a feature:

  ln.Feature(name='project', dtype='?').save()
  ln.Feature(name='disease', dtype='?').save()

This errored because we hadn’t yet registered features. After copy and paste from the error message, things work out:

ln.Feature(name='project', dtype='cat[ULabel]').save()
ln.Feature(name='disease', dtype='cat[bionty.Disease]').save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features
Hide code cell output
  Features
    'disease' = 'Alzheimer disease', 'atopic eczema'
    'project' = 'Project A', 'Project B'

Since genes are the measurements, we register them as features:

feature_set = ln.FeatureSet(genes)
feature_set.save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()
Hide code cell output
Artifact(uid='5RJ3R707BfL6oW8S0000', is_latest=True, description='a zarr object', suffix='.zarr', size=974, hash='JHntvZKnc4oE7QrJOBoh4A', n_objects=2, _hash_type='md5-d', visibility=1, _key_is_virtual=True, created_at=2024-11-11 14:18:17 UTC)
  Provenance
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-curate-any'
    .created_by = 'testuser1'
  Labels
    .diseases = 'Alzheimer disease', 'atopic eczema'
    .ulabels = 'Project A', 'Project B'
  Features
    'disease' = 'Alzheimer disease', 'atopic eczema'
    'project' = 'Project A', 'Project B'
  Feature sets
    'genes' = 'BRCA2', 'TP53', 'KRAS', 'BRAF', 'PTEN', 'ESR1', 'ERBB2', 'PIK3C2B', 'EGFR', 'CCN2'
Hide code cell content
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.7/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamin_cli/__main__.py", line 209, in delete
    return delete(instance, force=force)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamindb_setup/_delete.py", line 102, in delete
    n_objects = check_storage_is_empty(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamindb_setup/core/upath.py", line 817, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage '/home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb' contains 2 objects - delete them prior to deleting the instance