Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf-docs-namespace-typescript-examples/tsMoej_yo3g0KMHe/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=tsMoej_yo3g0KMHe&q=85&s=16a86ecc43dfa9ff35068d69c809cdb5

View on Hugging Face

Source dataset card and downloadable files for lance-format/oxford-pets-lance.
A Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat and dog photos across 37 breeds — sourced from pcuenq/oxford-pets. Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a cosine-normalized CLIP image embedding, all available directly from the Hub at hf://datasets/lance-format/oxford-pets-lance/data.

Key features

  • Inline JPEG bytes in the image column — no sidecar files, no image folders.
  • Pre-computed CLIP image embeddings (image_emb, OpenCLIP ViT-B-32, 512-dim, cosine-normalized) with a bundled IVF_PQ index for similarity search.
  • Both breed and species labels (label_name, is_dog) so a query can target a specific breed, all dogs, or all cats by stacking simple predicates.
  • Bitmap indices on both label columns make species- and breed-based curation a cheap predicate rather than a full scan.

Splits

SplitRowsNotes
train.lance7,390The pcuenq/oxford-pets source mirror ships a single split; the canonical Oxford-IIIT trainval/test partition is not pre-applied here.

Schema

ColumnTypeNotes
idint64Row index within split (natural join key for merges)
imagelarge_binaryInline JPEG bytes (quality 92)
label_namestringOne of 37 breeds, underscore-spaced (british_shorthair, golden_retriever, …)
is_dogbooltrue for dog breeds, false for cat breeds
pathstring?Original filename from the source dataset
image_embfixed_size_list<float32, 512>OpenCLIP ViT-B-32 image embedding (cosine-normalized)

Pre-built indices

  • IVF_PQ on image_emb — vector similarity search (cosine)
  • BITMAP on label_name — fast lookup by breed
  • BITMAP on is_dog — fast species filter

Why Lance?

  1. Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
  2. Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
  3. Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
  4. Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
  5. Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
  6. Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.

Load with datasets.load_dataset

You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample without installing anything Lance-specific.
import datasets

hf_ds = datasets.load_dataset("lance-format/oxford-pets-lance", split="train", streaming=True)
for row in hf_ds.take(3):
    print(row["label_name"], row["is_dog"])

Load with LanceDB

LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
tbl = db.open_table("train")
print(len(tbl))

Load with Lance

pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect or operate on dataset internals — schema, scanner, fragments, and the list of pre-built indices.
import lance

ds = lance.dataset("hf://datasets/lance-format/oxford-pets-lance/data/train.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())
Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy:
hf download lance-format/oxford-pets-lance --repo-type dataset --local-dir ./oxford-pets-lance
Then point Lance or LanceDB at ./oxford-pets-lance/data.
The bundled IVF_PQ index on image_emb makes approximate-nearest-neighbor search a single call. In production you would encode a query photo through the same OpenCLIP ViT-B-32 model used at ingest and pass the resulting 512-d vector to tbl.search(...). The example below uses the embedding stored in row 0 as a runnable stand-in so the snippet works without a model loaded; on a clean run the first hit is expected to be the seed image itself, which is a useful sanity check on the index.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
tbl = db.open_table("train")

seed = (
    tbl.search()
    .select(["image_emb", "label_name", "is_dog"])
    .limit(1)
    .to_list()[0]
)

hits = (
    tbl.search(seed["image_emb"])
    .metric("cosine")
    .select(["id", "label_name", "is_dog"])
    .limit(10)
    .to_list()
)
print(f"seed: {seed['label_name']} (is_dog={seed['is_dog']})")
for r in hits:
    print(f"  {r['id']:>5}  {r['label_name']:<22}  is_dog={r['is_dog']}")
Tune metric, nprobes, and refine_factor to trade recall against latency for your workload.

Curate

A typical curation pass for a fine-grained pet classifier stacks the species predicate and a breed predicate inside a single filtered scan. With bitmap indices on both label_name and is_dog, the result comes back in milliseconds, and the bounded .limit(200) keeps it small enough to inspect or hand off to a training run.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
tbl = db.open_table("train")

candidates = (
    tbl.search()
    .where("is_dog = true AND label_name IN ('golden_retriever', 'beagle', 'pug')")
    .select(["id", "label_name", "is_dog", "path"])
    .limit(200)
    .to_list()
)
print(f"{len(candidates)} candidates; first: {candidates[0]['label_name']}")
The result is a plain list of dictionaries, ready to inspect, persist as a manifest of ids, or feed into the Evolve and Train workflows below. The image and image_emb columns are never read by this query, so the network traffic is dominated by the small label fields rather than JPEG bytes or vectors.

Evolve

Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below derives a species string from the is_dog boolean and adds a coarse breed-group flag for terriers, either of which can then be used directly in where clauses without recomputing the predicate on every query.
Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full split first.
import lancedb

db = lancedb.connect("./oxford-pets-lance/data")  # local copy required for writes
tbl = db.open_table("train")

tbl.add_columns({
    "species": "CASE WHEN is_dog THEN 'dog' ELSE 'cat' END",
    "is_terrier": "label_name LIKE '%terrier%'",
})
If the values you want to attach already live in another table (offline labels, classifier predictions, an integer class id), merge them in by joining on id:
import pyarrow as pa

class_ids = pa.table({
    "id": pa.array([0, 1, 2]),
    "label_int": pa.array([0, 0, 17]),
})
tbl.merge(class_ids, on="id")
The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second embedding model over the JPEG bytes), Lance provides a batch-UDF API in the underlying library — see the Lance data evolution docs for that pattern.

Train

Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through Permutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetch, shuffling, and batching behave as in any PyTorch pipeline. For a from-scratch breed classifier, project the JPEG bytes and the string breed label; for a linear probe on top of frozen CLIP features, swap the projection to the embedding column and skip JPEG decoding entirely.
import lancedb
from lancedb.permutation import Permutation
from torch.utils.data import DataLoader

db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
tbl = db.open_table("train")

train_ds = Permutation.identity(tbl).select_columns(["image", "label_name"])
loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4)

for batch in loader:
    # batch carries only the projected columns; image_emb stays on disk.
    # decode the JPEG bytes, map label_name -> int via a class list, forward, cross-entropy...
    ...
Switching feature sets is a configuration change: passing ["image_emb", "label_name"] to select_columns(...) on the next run reads only the cached 512-d vectors and the label, which is the right shape for a linear probe or a lightweight reranker. Projecting ["image", "is_dog"] reduces the task to binary species classification on the same data.

Versioning

Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
tbl = db.open_table("train")

print("Current version:", tbl.version)
print("History:", tbl.list_versions())
print("Tags:", tbl.tags.list())
Once you have a local copy, tag a version for reproducibility:
local_db = lancedb.connect("./oxford-pets-lance/data")
local_tbl = local_db.open_table("train")
local_tbl.tags.create("clip-vitb32-v1", local_tbl.version)
A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one:
tbl_v1 = db.open_table("train", version="clip-vitb32-v1")
tbl_v5 = db.open_table("train", version=5)
Pinning supports two workflows. A retrieval system locked to clip-vitb32-v1 keeps returning stable results while the dataset evolves in parallel; newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and labels, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.

Materialize a subset

Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through .to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.
import lancedb

remote_db = lancedb.connect("hf://datasets/lance-format/oxford-pets-lance/data")
remote_tbl = remote_db.open_table("train")

batches = (
    remote_tbl.search()
    .where("is_dog = true")
    .select(["id", "image", "label_name", "is_dog", "image_emb"])
    .to_batches()
)

local_db = lancedb.connect("./oxford-pets-dogs-subset")
local_db.create_table("train", batches)
The resulting ./oxford-pets-dogs-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/oxford-pets-lance/data for ./oxford-pets-dogs-subset.

Source & license

Converted from pcuenq/oxford-pets. Released under CC BY-SA 4.0.

Citation

@inproceedings{parkhi2012cats,
  title={Cats and Dogs},
  author={Parkhi, Omkar M. and Vedaldi, Andrea and Zisserman, Andrew and Jawahar, C. V.},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2012}
}