Query arrays¶
We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.
Let us now look at the following case:
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]
Because the artifact was validated, querying the DataFrame
is guaranteed to succeed!
Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.
For a use case with TileDB, see: CELLxGENE: scRNA-seq
For a use case with DuckDB, see: RxRx: cell imaging
In this notebook, we show how to subset an AnnData
and generic HDF5
and zarr
collections accessed in the cloud.
Show code cell content
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-array-notebook --name test-array-notebook
✓ logged in with email testuser1@lamin.ai (uid: DzTjkKse)
→ go to: https://lamin.ai/testuser1/test-array-notebook
! updating cloud SQLite 's3://lamindb-ci/test-array-notebook/58eab9b6d7965975a7dc17a4bcbc5306.lndb' of instance 'testuser1/test-array-notebook'
→ connected lamindb: testuser1/test-array-notebook
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
import lamindb as ln
Show code cell output
→ connected lamindb: testuser1/test-array-notebook
ln.settings.verbosity = "info"
We’ll need some test data:
ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! record with similar root exists! did you mean to load it?
uid | root | description | type | region | instance_uid | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1 | ZGQKyrDe6kM7 | s3://lamindb-ci/test-array-notebook | None | s3 | us-west-1 | 6BlTiS2HOWwo | None | 2024-11-11 14:17:26.297195+00:00 | 1 |
! no run & transform got linked, call `ln.track()` & re-run
Artifact(uid='nc49KOnOJGhnMFqc0000', is_latest=True, key='lndb-storage/testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', _hash_type='md5', visibility=1, _key_is_virtual=False, storage_id=2, created_by_id=1, created_at=2024-11-11 14:17:37 UTC)
Note that it is also possible to register Hugging Face paths. For this huggingface_hub
package should be installed.
ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()
Show code cell output
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
! no run & transform got linked, call `ln.track()` & re-run
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/upath/core.py:170: UserWarning: UPath 'hf' filesystem not explicitly implemented. Falling back to default implementation. This filesystem may not be tested.
upath_cls = get_upath_class(protocol=pth_protocol)
! will manage storage location hf://datasets/Koncopd/lamindb-test with instance testuser1/test-array-notebook
→ due to lack of write access, LaminDB won't manage storage location: hf://datasets/Koncopd/lamindb-test
→ deleted storage record on hub e82908a3045a5fecadfe01b36107a2e4
Artifact(uid='JAX10G66pGwMBZBS0000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_objects=11, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=3, created_by_id=1, created_at=2024-11-11 14:17:41 UTC)
AnnData¶
An h5ad
artifact stored on s3:
artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
artifact.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = artifact.open()
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
This object is an AnnDataAccessor
object, an AnnData
object backed in the cloud:
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor
object references underlying lazy h5
or zarr
arrays:
adata.X
Show code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData
object:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
adata_subset.X
Show code cell output
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
dtype=float32)
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
adata_subset.to_memory()
Show code cell output
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Generic HDF5¶
Let us query a generic HDF5 artifact:
artifact = ln.Artifact.get(key="lndb-storage/testfile.hdf5")
And get a backed accessor:
backed = artifact.open()
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
The returned object contains the .connection
and h5py.File
or zarr.Group
in .storage
backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Parquet¶
A dataframe stored as sharded parquet
.
artifact = ln.Artifact.get(key="sharded_parquet")
artifact.path.view_tree()
Show code cell output
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
└── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
This returns a pyarrow dataset.
backed
<pyarrow._dataset.FileSystemDataset at 0x7fee77a5e2c0>
backed.head(5).to_pandas()
Show code cell output
cell_type | n_genes | percent_mito | |
---|---|---|---|
index | |||
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
Show code cell content
# clean up test instance
!lamin delete --force test-array-notebook
• deleting instance testuser1/test-array-notebook
→ deleted storage record on hub e0641645e20f57989a1a3e3364b9e548
→ deleted instance record on hub 58eab9b6d7965975a7dc17a4bcbc5306