Documentation Index
Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for
lance-format/ade20k-lance.1aurent/ADE20K. Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene class labels, the full per-polygon object-name list, an OpenCLIP image embedding, and pre-built indices — all available directly from the Hub at hf://datasets/lance-format/ade20k-lance/data.
Key features
- Inline image and segmentation bytes — both the JPEG image and the RGB-encoded PNG segmentation map ride on the same row, so an annotated example is a single row read with no sidecar files.
- Per-polygon object metadata —
object_nameskeeps the full list (one entry per annotated polygon),objects_presentis the deduped set used for class-presence filters, andnum_objectsis precomputed. - CLIP image embeddings (
image_emb, OpenCLIP ViT-B/32, 512-d, cosine-normalized) for visual retrieval over scenes. - Indices shipped on disk —
IVF_PQonimage_emb,BTREEonnum_objects, andLABEL_LISTonobjects_presentfor fastarray_has_any/array_has_allpredicates.
Splits
| Split | Rows |
|---|---|
train.lance | 25,574 |
validation.lance | 2,000 |
Schema
| Column | Type | Notes |
|---|---|---|
id | int64 | Row index within split |
image | large_binary | Inline JPEG bytes |
segmentation | large_binary | Inline PNG bytes — semantic segmentation map (RGB encoding per ADE20K spec) |
instance | large_binary? | Inline PNG bytes — instance map; null if not provided |
filename | string | ADE20K relative filename |
scene | list<string> | Scene class labels (e.g. ["bathroom"]) |
object_names | list<string> | Per-polygon object names (one entry per polygon, not deduped) |
objects_present | list<string> | Deduped object names — feeds the LABEL_LIST index |
num_objects | int32 | Number of annotated objects |
image_emb | fixed_size_list<float32, 512> | OpenCLIP ViT-B/32 image embedding (cosine-normalized) |
Pre-built indices
IVF_PQonimage_emb— vector similarity search (cosine)BTREEonnum_objects— fast range filters on scene complexityLABEL_LISTonobjects_present— supportsarray_has_any/array_has_allfor class-presence filtering
Why Lance?
- Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
- Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
- Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
- Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
- Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
- Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.
Load with datasets.load_dataset
You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample without installing anything Lance-specific.
Load with LanceDB
LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.Load with Lance
pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices.
Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access, ANN search, and any mutation are far faster against a local copy:Then point Lance or LanceDB at./ade20k-lance/data.
Search
The bundledIVF_PQ index on image_emb makes approximate-nearest-neighbor scene retrieval a single call. In production you would encode a query image through the same OpenCLIP ViT-B/32 model used at ingest and pass the resulting 512-d vector to tbl.search(...). The example below uses the embedding stored on row 42 as a runnable stand-in, so the snippet works without loading any model.
nprobes and refine_factor to trade recall against latency for your workload.
Curate
Curation for a semantic-segmentation workflow usually means picking scenes that contain specific classes, possibly bounded by complexity. TheLABEL_LIST index on objects_present makes class-presence predicates trivial, and Lance evaluates them inside the same scan as a structural filter on num_objects. The bounded .limit(500) keeps the result small and inspectable, and the segmentation blob is left out of the projection so the candidate scan is dominated by metadata, not PNG bytes.
ids, or feed into the Evolve and Train workflows below. Swapping array_has_all for array_has_any widens the recall; replacing the structural predicate with num_objects BETWEEN 3 AND 6 selects simpler scenes for an ablation slice.
Evolve
Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds ahas_person flag and a scene_label string pulled out of the scene list, either of which can then be used directly in where clauses without recomputing the predicate on every query.
Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full corpus first.
id:
Train
Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this throughPermutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a semantic-segmentation run, project the JPEG bytes and the segmentation PNG bytes; both are decoded inside the training step. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected.
["image_emb", "objects_present"] to select_columns(...) on the next run skips JPEG and PNG decoding entirely and reads only the cached 512-d vectors plus the deduped class list, which is the right shape for training a lightweight scene classifier or a class-presence probe on top of frozen features.
Versioning
Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.segmenter-baseline-v1 keeps reading the exact same segmentation maps and class lists while the dataset evolves in parallel; newly merged predictions or evolved columns do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images, so changes in mIoU reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.
Materialize a subset
Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through.to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.
./ade20k-indoor-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/ade20k-lance/data for ./ade20k-indoor-subset.
Source & license
Converted from1aurent/ADE20K. ADE20K is released under the BSD 3-Clause license by the MIT CSAIL Computer Vision group.