Query arrays¶

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

For a use case with TileDB, see: CELLxGENE: scRNA-seq
For a use case with DuckDB, see: RxRx: cell imaging

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

import lamindb as ln

ln.settings.verbosity = "info"

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()

Show code cell output Hide code cell output

! no run & transform got linked, call `ln.track()` & re-run

! record with similar root exists! did you mean to load it?

	uid	root	description	type	region	instance_uid	run_id	created_at	created_by_id
id
1	ZGQKyrDe6kM7	s3://lamindb-ci/test-array-notebook	None	s3	us-west-1	6BlTiS2HOWwo	None	2024-11-11 14:17:26.297195+00:00	1

! no run & transform got linked, call `ln.track()` & re-run

Artifact(uid='nc49KOnOJGhnMFqc0000', is_latest=True, key='lndb-storage/testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', _hash_type='md5', visibility=1, _key_is_virtual=False, storage_id=2, created_by_id=1, created_at=2024-11-11 14:17:37 UTC)

Note that it is also possible to register Hugging Face paths. For this huggingface_hub package should be installed.

ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()

AnnData¶

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")

artifact.path

S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')

adata = artifact.open()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset

Subsets load arrays into memory upon direct access:

adata_subset.X

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()

Generic HDF5¶

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="lndb-storage/testfile.hdf5")

And get a backed accessor:

backed = artifact.open()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed

BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)

backed.storage

<HDF5 file "testfile.hdf5>" (mode r)>

Parquet¶

A dataframe stored as sharded parquet.

artifact = ln.Artifact.get(key="sharded_parquet")

artifact.path.view_tree()

backed = artifact.open()

This returns a pyarrow dataset.

backed

<pyarrow._dataset.FileSystemDataset at 0x7fee77a5e2c0>

backed.head(5).to_pandas()

Show code cell output Hide code cell output

	cell_type	n_genes	percent_mito
index
CGTTATACAGTACC-8	CD4+/CD45RO+ Memory	1034	0.010163
AGATATTGACCACA-1	CD4+/CD45RO+ Memory	1078	0.012831
GCAGGGCTGTATGC-8	CD8+/CD45RA+ Naive Cytotoxic	1055	0.012287
TTATGGCTGGCAAG-2	CD4+/CD25 T Reg	1236	0.023963
CACGACCTGGGAGT-7	CD4+/CD25 T Reg	1010	0.016620