Tutorial: Features & labels¶

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.
Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Curate datasets.

import lamindb as ln
import pandas as pd
import pytest

ln.settings.verbosity = "hint"

TLDR¶

Annotate by labels¶

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.get(key="iris_studies/study0_raw_images")
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()

Annotate by features¶

Features are buckets for labels, numbers and other data types.

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()

	species	file_name	scientist	instrument
0	setosa	iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce...	Barbara McClintock	Leica IIIc Camera
1	versicolor	iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710...	Edgar Anderson	Leica IIIc Camera
2	versicolor	iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf...	Edgar Anderson	Leica IIIc Camera
3	setosa	iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109...	Edgar Anderson	Leica IIIc Camera
4	virginica	iris-bdae8314e4385d8e2322abd8e63a82758a9063c77...	Edgar Anderson	Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()

species        3
file_name     50
scientist      2
instrument     1
dtype: int64

Let’s annotate the artifact with features & values and add a temperature measurement that Barbara & Edgar had forgotten in their csv:

with pytest.raises(ln.core.exceptions.ValidationError) as error:
    artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
print(error.exconly())

As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:

ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
ln.Feature(name='study', dtype='cat[ULabel]').save()
ln.Feature(name='temperature', dtype='float').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)

Now everything works:

artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
artifact.describe()

Because we also re-labeled with the study label Study 0: initial plant gathering', we see that it appears under the study feature.

Retrieve features¶

artifact.features.get_values()

Query by features¶

artifact = ln.Artifact.features.get(temperature=27.6)
artifact

Artifact(uid='kpRCWE2uuba1BoFZ0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1, created_at=2024-11-11 14:15:39 UTC)

Register metadata¶

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels¶

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)

Query artifacts by labels¶

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()

	uid	name	description	reference	reference_type	run_id	created_at	created_by_id
id
8	P6mR2iO9	is_species	None	None	None	None	2024-11-11 14:15:50.802767+00:00	1
7	AEu7Xefw	Leica IIIc Camera	None	None	None	None	2024-11-11 14:15:50.650212+00:00	1
6	LSbiAn3Y	Edgar Anderson	None	None	None	None	2024-11-11 14:15:50.645175+00:00	1
5	pv5xuWka	Barbara McClintock	None	None	None	None	2024-11-11 14:15:50.645081+00:00	1
4	G3cOjCNi	virginica	None	None	None	None	2024-11-11 14:15:50.637919+00:00	1
3	0Bes6KfA	versicolor	None	None	None	None	2024-11-11 14:15:50.637846+00:00	1
2	dHeUFSNg	setosa	None	None	None	None	2024-11-11 14:15:50.637727+00:00	1
1	4V9zyRou	Study 0: initial plant gathering	My initial study	None	None	None	2024-11-11 14:15:49.669640+00:00	1

ulabels = ln.ULabel.lookup()
ln.Artifact.get(ulabels=ulabels.study_0_initial_plant_gathering)

Run an ML model¶

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.context.track(transform=transform)
df = run_ml_model()

The output is a dataframe:

df.head()

Show code cell output Hide code cell output

	sepal_length	sepal_width	petal_length	petal_width	iris_organism_name
0	0.051	0.035	0.014	0.002	setosa
1	0.049	0.030	0.014	0.002	setosa
2	0.047	0.032	0.013	0.002	setosa
3	0.046	0.031	0.015	0.002	setosa
4	0.050	0.036	0.014	0.002	setosa

And this is the pipeline that produced the dataframe:

ln.context.transform.view_lineage()

Register the output data¶

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()

species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)

species_labels

<QuerySet [ULabel(uid='dHeUFSNg', name='setosa', created_by_id=1, created_at=2024-11-11 14:15:50 UTC), ULabel(uid='0Bes6KfA', name='versicolor', created_by_id=1, created_at=2024-11-11 14:15:50 UTC), ULabel(uid='G3cOjCNi', name='virginica', created_by_id=1, created_at=2024-11-11 14:15:50 UTC)]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()

See the database content:

ln.view(registries=["Feature", "ULabel"])

Show code cell output Hide code cell output

Feature

	uid	name	dtype	unit	description	synonyms	run_id	created_at	created_by_id
id
10	7BEjImxavomc	iris_organism_name	cat	None	None	None	2.0	2024-11-11 14:15:53.054220+00:00	1
9	CWh7vL9IlQif	petal_width	float	None	None	None	2.0	2024-11-11 14:15:53.054142+00:00	1
8	COc7Gikjrqx6	petal_length	float	None	None	None	2.0	2024-11-11 14:15:53.054065+00:00	1
7	VGt6FMK48cia	sepal_width	float	None	None	None	2.0	2024-11-11 14:15:53.053984+00:00	1
6	Nu8vHqJDMKZZ	sepal_length	float	None	None	None	2.0	2024-11-11 14:15:53.053856+00:00	1
5	fJW5YwHVvOJ2	temperature	float	None	None	None	NaN	2024-11-11 14:15:50.628672+00:00	1
4	y1obPXFSSV0B	study	cat[ULabel]	None	None	None	NaN	2024-11-11 14:15:50.624406+00:00	1

ULabel

	uid	name	description	reference	reference_type	run_id	created_at	created_by_id
id
8	P6mR2iO9	is_species	None	None	None	None	2024-11-11 14:15:50.802767+00:00	1
7	AEu7Xefw	Leica IIIc Camera	None	None	None	None	2024-11-11 14:15:50.650212+00:00	1
6	LSbiAn3Y	Edgar Anderson	None	None	None	None	2024-11-11 14:15:50.645175+00:00	1
5	pv5xuWka	Barbara McClintock	None	None	None	None	2024-11-11 14:15:50.645081+00:00	1
4	G3cOjCNi	virginica	None	None	None	None	2024-11-11 14:15:50.637919+00:00	1
3	0Bes6KfA	versicolor	None	None	None	None	2024-11-11 14:15:50.637846+00:00	1
2	dHeUFSNg	setosa	None	None	None	None	2024-11-11 14:15:50.637727+00:00	1

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix¶

Manage metadata¶

Avoid duplicates¶

Let’s create a label "project1":

ln.ULabel(name="project1").save()

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")

Show code cell output Hide code cell output

! record with similar name exists! did you mean to load it?

	uid	name	description	reference	reference_type	run_id	created_at	created_by_id
id
9	56NWIBTQ	project1	None	None	None	2	2024-11-11 14:15:53.286769+00:00	1

ULabel(uid='YsHzSblJ', name='project 1', created_by_id=1, run_id=2)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via search_names.

Update & delete records¶

label = ln.ULabel.filter(name="project1").first()
label

label.name = "project1a"
label.save()
label

label.delete()

Manage storage¶

Change default storage¶

The default storage location is:

ln.settings.storage

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations¶

ln.Storage.df()

Show code cell output Hide code cell output

	uid	root	description	type	region	instance_uid	run_id	created_at	created_by_id
id
2	vmmXDGVVv2OV	s3://lamindata	None	s3	us-east-1	None	None	2024-11-11 14:15:39.611733+00:00	1
1	4lRZgk4RjK4W	/home/runner/work/lamindb/lamindb/docs/lamin-t...	None	local	None	5WuFt3cW4zRx	None	2024-11-11 14:15:30.639158+00:00	1