Documentation Index
Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for
lance-format/gqa-testdev-balanced-lance.testdev_balanced slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from lmms-lab/GQA. The original redistribution ships instructions and images as separate parquet configs; here they are pre-joined on image_id, so each row carries the question text, the short answer, the GQA reasoning-program tags, paired CLIP image and question embeddings, and the inline JPEG bytes — all available directly from the Hub at hf://datasets/lance-format/gqa-testdev-balanced-lance/data.
Key features
- Inline JPEG bytes in the
imagecolumn, duplicated across rows that share animage_idso each Q/A row is self-contained. - Paired CLIP embeddings in the same row —
image_embandquestion_emb(512-dim, cosine-normalized) — for cross-modal retrieval as one indexed lookup. - Compositional reasoning metadata —
structural,semantic, anddetailedquestion-type tags plus thesemantic_strreasoning program. - Pre-built ANN, FTS, scalar, and bitmap indices covering both embeddings, the question and short answer, the reasoning-type tags, and the image/question ids.
Splits
| Split | Rows | Distinct images |
|---|---|---|
testdev.lance | 12,578 | 398 |
--instr-config / --images-config to gqa/dataprep.py to extend.
Schema
| Column | Type | Notes |
|---|---|---|
id | int64 | Row index within split |
image | large_binary | Inline JPEG bytes (duplicated across rows that share an image_id) |
image_id | string | GQA scene-graph image id |
question_id | string | GQA question id |
question | string | Compositional natural-language question |
answers | list<string> | One-element list (the GQA short answer) |
answer | string | Canonical short answer (used for FTS) |
full_answer | string? | Full-sentence answer |
structural | string? | One of verify, query, compare, choose, logical |
semantic | string? | One of attr, cat, global, obj, rel |
detailed | string? | Fine-grained type (e.g. weatherVerifyC) |
is_balanced | bool | GQA balanced subset flag |
group_global, group_local | string? | GQA reasoning-group ids |
semantic_str | string? | Compact description of the reasoning program |
image_emb | fixed_size_list<float32, 512> | CLIP image embedding (cosine-normalized) |
question_emb | fixed_size_list<float32, 512> | CLIP text embedding of the question |
Pre-built indices
IVF_PQonimage_emb— image-side vector search (cosine)IVF_PQonquestion_emb— question-side vector search (cosine)INVERTED(FTS) onquestionandanswer— keyword and hybrid searchBITMAPonstructural,semantic,detailed— fast categorical filters on the reasoning programBTREEonimage_id,question_id— fast lookup by GQA id
Why Lance?
- Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
- Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
- Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
- Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
- Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
- Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.
Load with datasets.load_dataset
You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample.
Load with LanceDB
LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.Load with Lance
pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices.
Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy:Then point Lance or LanceDB at./gqa-testdev-balanced-lance/data.
Search
The bundledIVF_PQ index on image_emb makes cross-modal text→image retrieval a single call: encode a question with the same CLIP model used at ingest (ViT-B/32, cosine-normalized), then pass the resulting 512-d vector to tbl.search(...) and target image_emb. The example below uses the question_emb already stored in row 42 as a runnable stand-in for “the CLIP encoding of a question”, so the snippet works without any model loaded.
vector_column_name="image_emb" for question_emb to find paraphrased or topically related questions instead.
The dataset also ships an INVERTED index on question and answer, so the same query can be issued as a hybrid search that combines the dense vector with a literal keyword match. This is useful when a noun like “umbrella” must appear in the question text but you still want CLIP to handle visual similarity over the candidate set.
metric, nprobes, and refine_factor on the vector side to trade recall against latency.
Curate
A typical curation pass for a compositional-reasoning study combines a predicate on the question text (or the GQA short answer) with a structural filter on the reasoning program, so the candidate set is both topically and structurally consistent. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded.limit(500) makes it cheap to inspect before committing the subset to anything downstream.
question_ids, or feed into the Evolve and Train workflows below. The image column is never read, so the network traffic for a 500-row candidate scan is dominated by the question and answer strings rather than JPEG bytes.
Evolve
Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds anis_binary_answer flag and a question_length integer, either of which can then be used directly in where clauses without recomputing the predicate on every query.
Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full split first.
question_id:
Train
Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this throughPermutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a VQA fine-tune, project the JPEG bytes, the question, and the short answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected.
["image_emb", "question_emb", "answer"] to select_columns(...) on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for a lightweight reasoning probe over frozen CLIP features.
Versioning
Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.clip-vitb32-v1 keeps returning stable results while the dataset evolves in parallel — newly added model predictions or reasoning annotations do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and questions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.
Materialize a subset
Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through.to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.
./gqa-yesno-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/gqa-testdev-balanced-lance/data for ./gqa-yesno-subset.
Source & license
Converted fromlmms-lab/GQA. GQA is released under CC BY 4.0 by Hudson and Manning (Stanford NLP).