Documentation Index
Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for
lance-format/openvid-lance.hf://datasets/lance-format/openvid-lance/data/train.lance.
Key features
- Inline MP4 bytes in the
video_blobcolumn, stored in a side blob file and surfaced as lazyBlobFilehandles viatake_blobs— metadata scans, search, and filtering never read a single byte of video data. - Pre-computed 1024-dim video embeddings in
embeddingwith a bundledIVF_PQANN index. - Pre-built
INVERTED(FTS) index oncaptionfor keyword and hybrid search. - Rich quality signals —
aesthetic_score,motion_score,temporal_consistency_score,camera_motion,fps,seconds— that downstream filters can stack on.
Splits
train.lance
Schema
| Column | Type | Notes |
|---|---|---|
video_blob | large_binary (blob-encoded) | Inline MP4 bytes; stored in a separate blob file and read lazily through take_blobs |
video_path | string | Original file path / object key |
caption | string | Text description of the clip |
embedding | fixed_size_list<float32, 1024> | Video embedding |
aesthetic_score | float64 | Visual quality, roughly 0–6 |
motion_score | float64 | Amount of motion, 0–1 |
temporal_consistency_score | float64 | Frame-to-frame stability, 0–1 |
camera_motion | string | pan, zoom, static, etc. |
fps | float64 | Frames per second |
seconds | float64 | Clip duration |
frame | int64 | Total frame count |
Pre-built indices
IVF_PQonembedding— video similarity (L2)INVERTED(FTS) oncaption— keyword and hybrid search
Why Lance?
- Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
- Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
- Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
- Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
- Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
- Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.
Load with datasets.load_dataset
You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you just want a quick streaming sample.
Load with LanceDB
LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.Load with Lance
pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices — or when you need the blob-level take_blobs entry point that streams video bytes lazily from inline storage.
Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access, ANN search, and video decoding are far faster against a local copy:Then point Lance or LanceDB at./openvid/data.
Search
The bundledIVF_PQ index on embedding makes approximate-nearest-neighbor search a single call. In production you would encode a text prompt through a text-to-video model or a reference clip through the same video encoder used at ingest, and pass the resulting 1024-d vector to tbl.search(...). The example below uses the embedding from row 42 as a runnable stand-in.
video_blob column is never read, so the network traffic for a top-10 search is dominated by a few kilobytes of caption text, not by megabytes of MP4. The lazy blob fetch comes later — see Curate below.
Because OpenVid also ships an INVERTED index on caption, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call.
metric, nprobes, and refine_factor on the vector side to trade recall against latency.
Curate
Curation for a video workflow almost always starts as a metadata filter — pick the dynamic, high-aesthetic, well-stabilized clips first, then decide what to do with the video bytes. Stacking predicates inside a single filtered scan keeps the result small and explicit, and the bounded.limit(200) makes it cheap to inspect or hand off.
video_blob column. Lance stores blobs in a separate side file referenced by the dataset, so column-projected reads skip them entirely until they are explicitly requested. That is what makes “find me the right clips” a metadata-only operation against a million-row video corpus.
Once the candidate set is fixed, pull the actual video bytes through pylance’s take_blobs. It returns one BlobFile per row — a file-like handle that streams from inline blob storage on demand rather than reading the full clip into Python memory up front. For video specifically, this is the operation that matters: a video model trainer or a dataloader inspecting a few seconds of each clip should never have to materialize entire MP4s in memory just to inspect or decode part of them.
BlobFile implements the file protocol, so it can be passed straight to a decoder like PyAV without first being copied through a bytes object. The decoder seeks and reads against the underlying handle, which means a 2-second sample from a 30-second clip moves only the bytes the decoder actually touches — not the whole MP4.
.read() on each handle. The lazy semantics are the same; read() simply materializes the full blob for that one row.
Evolve
Lance stores each column independently, so a new column can be appended without rewriting the existing data — including the video blobs, which stay exactly where they are. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds aduration_bucket and a is_high_quality flag, either of which can then be used directly in where clauses without re-evaluating the predicate on every query.
Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full corpus first.
video_path:
video_blob side file are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running an alternative video encoder over the inline bytes), Lance provides a batch-UDF API — see the Lance data evolution docs.
Train
A common pattern for video training is to pre-extract decoded frames once into a derived LanceDB table, and train against that table with the regular projection-based dataloader.take_blobs is the mechanism that makes the extraction step tractable: each clip’s MP4 is randomly addressable, so the pass can subset bytes on demand and write decoded windows into a fresh table without an external file store. Other workflows project video_blob directly through select_columns(...) and decode at the batch boundary, or skip pixels entirely and train on the cached embeddings — the right shape is workload-specific. The actual training loop is the same Permutation.identity(tbl).select_columns(...) snippet in every case; only the source table and the column list change.
Against a pre-extracted frames table:
video_blob storage and take_blobs still earn their place outside of the training loop — random-access inspection of a clip in a notebook, sampling for human review, one-off evaluation against a held-out set, and the pre-extraction step itself — but they are not the dataloader.
Versioning
Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk, with the same blob handles still valid. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.quality-v1 keeps returning stable results while the dataset evolves in parallel — newly added columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same clips, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.
Materialize a subset
Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access into the blob file. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through.to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory — including the video_blob column, which streams through Arrow record batches rather than being assembled in a single buffer.
./openvid-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/openvid-lance/data for ./openvid-subset. The same take_blobs pattern from Curate also works against the local copy — and runs faster, because the blob side file is now on local disk.