Documentation Index
Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for
lance-format/hotpotqa-distractor-lance.distractor config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8 distractors). The dataset ships with MiniLM question embeddings, flattened context text for full-text search, and pre-built ANN/FTS indices, available directly from the Hub at hf://datasets/lance-format/hotpotqa-distractor-lance/data.
Key features
- Multi-hop questions with gold supporting facts — each row carries the question, the canonical short answer, and the
(title, sent_id)pointers into the paragraphs that justify it. - Ten candidate paragraphs per question in the parallel
context_titles/context_sentencescolumns, plus a flattenedcontext_textfield that feeds the FTS index. - Pre-computed 384-dim question embeddings (
question_emb,sentence-transformers/all-MiniLM-L6-v2, cosine-normalized) with a bundledIVF_PQindex for semantic question lookup. - One columnar dataset — scan metadata cheaply, then read the heavy context text only for the rows you actually want.
Splits
| Split | Rows |
|---|---|
train.lance | 90,447 |
validation.lance | 7,405 |
Schema
| Column | Type | Notes |
|---|---|---|
id | string | HotpotQA question id |
question | string | The question |
answer | string | Reference short answer (yes / no / span) |
type | string? | bridge or comparison |
level | string? | easy / medium / hard |
supporting_titles | list<string> | Wikipedia titles that contain the gold facts |
supporting_sent_ids | list<int32> | Sentence indices into those titles |
context_titles | list<string> | All 10 paragraph titles (gold + distractors) |
context_sentences | list<list<string>> | Sentences per paragraph |
context_text | string | Flattened paragraphs — feeds the FTS index |
num_supporting_facts | int32 | Number of gold supporting facts |
question_emb | fixed_size_list<float32, 384> | MiniLM question embedding |
Pre-built indices
IVF_PQonquestion_emb— semantic question lookup (cosine)INVERTED(FTS) onquestionandcontext_text— keyword and hybrid searchBTREEonid,answer— stable lookup by identifierBITMAPontype,level— cheap predicate evaluation for question class
Why Lance?
- Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
- Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
- Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
- Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
- Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
- Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.
Load with datasets.load_dataset
You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample.
Load with LanceDB
LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. Each.lance file in data/ is a table — open by name (train, validation). The same handle is used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.
Load with Lance
pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices.
Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy:Then point Lance or LanceDB at./hotpotqa-distractor-lance/data.
Search
The bundledIVF_PQ index on question_emb makes nearest-neighbour question lookup a single call. In production you would encode an incoming user question through the same 384-dim MiniLM encoder used at ingest and pass the resulting vector to tbl.search(...). The example below uses the embedding from row 42 as a runnable stand-in so the snippet works without loading a model.
question_emb is never read on the result side, and the long context_text body is left untouched, keeping the working set small even when the underlying scan touches every row of the train split.
Because the dataset also ships an INVERTED index on both question and context_text, the same query can be issued as a hybrid search that combines the dense vector with a keyword query against the full paragraph text. LanceDB merges the two result lists and reranks them in a single call, which is useful when a named entity must literally appear in one of the supporting paragraphs but the dense side still does most of the ranking.
metric, nprobes, and refine_factor on the vector side to trade recall against latency for your workload.
Curate
Building a focused evaluation slice usually means stacking predicates over the question metadata before any context text gets read. Lance evaluates the filter inside a single scan, so the candidate set comes back already filtered, and the bounded.limit(2000) keeps the output small enough to inspect. The example below assembles a set of hard, multi-hop comparison questions for which the gold answer is a real span rather than yes/no.
context_text nor context_sentences is read by this scan, so a 2000-row curation pass against the Hub moves only kilobytes of metadata.
Evolve
Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds aquestion_length column and a is_multi_hop flag, either of which can then be used directly in where clauses without recomputing the predicate on every query.
Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full corpus.
id:
Train
Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this throughPermutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a multi-hop QA model the natural projection is the question plus the flattened context and the gold answer; for a question-encoder retraining loop the precomputed embedding is enough on its own.
["question_emb", "answer"] to select_columns(...) on the next run reads only the 384-d vectors and the short answer string, which is the right shape for fine-tuning a retrieval head on cached embeddings. Columns added in Evolve cost nothing per batch until they are explicitly projected.
Versioning
Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.hard-multihop-v1 keeps returning stable supporting facts while the dataset evolves in parallel — newly added retriever scores or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same questions and contexts, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.
Materialize a subset
Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through.to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.
./hotpotqa-hard-comparison is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/hotpotqa-distractor-lance/data for ./hotpotqa-hard-comparison.
Source & license
Converted fromhotpot_qa (distractor config). HotpotQA is released under CC BY-SA 4.0.