Natural Questions Validation

View on Hugging Face

Source dataset card and downloadable files for lance-format/natural-questions-val-lance.

A Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored inline and the dataset ships with pre-built ANN/FTS indices, all available directly from the Hub at hf://datasets/lance-format/natural-questions-val-lance/data. Sourced from google-research-datasets/natural_questions.

The NQ train split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via natural_questions/dataprep.py --splits train once disk and bandwidth allow.

Key features

Real Google search queries with the full Wikipedia article that answers each one — document_html carries the inline UTF-8 HTML, so no sidecar files or external lookups are needed at query time.
Annotator answer summaries — short_answers aggregates and dedupes spans across all annotators, yes_no_answer carries the majority vote, and the has_short_answer / has_long_answer flags make annotation-coverage filters a single predicate.
Pre-computed 384-dim question embeddings (question_emb, sentence-transformers/all-MiniLM-L6-v2, cosine-normalized) with a bundled IVF_PQ index for semantic question lookup.
One columnar dataset — scan question metadata cheaply, then read the heavy document_html only for the rows you actually want.

Splits

Split	Rows
`validation.lance`	7,830

Schema

Column	Type	Notes
`id`	`string`	NQ example id
`question`	`string`	Original Google search query
`document_title`	`string`	Wikipedia article title
`document_url`	`string`	Wikipedia article URL
`document_html`	`large_binary`	Full HTML of the article (inline; UTF-8 bytes)
`short_answers`	`list<string>`	Deduped short-answer spans across all annotators
`num_short_answers`	`int32`	Total annotator spans (incl. duplicates)
`has_short_answer`	`bool`	At least one annotator provided a short-answer span
`has_long_answer`	`bool`	At least one annotator selected a long-answer candidate
`yes_no_answer`	`string`	`YES` / `NO` / `NONE` — majority vote across annotators
`question_emb`	`fixed_size_list<float32, 384>`	MiniLM question embedding

Pre-built indices

IVF_PQ on question_emb — semantic question lookup (cosine)
INVERTED (FTS) on question — keyword and hybrid search
BTREE on id, document_title — stable lookup by identifier
BITMAP on yes_no_answer, has_short_answer, has_long_answer — cheap predicate evaluation for annotation coverage

Why Lance?

Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.

Load with `datasets.load_dataset`

You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample.

import datasets

hf_ds = datasets.load_dataset("lance-format/natural-questions-val-lance", split="validation", streaming=True)
for row in hf_ds.take(3):
    print(row["question"], "->", row["short_answers"])

Load with LanceDB

LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
print(len(tbl))

Load with Lance

pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices.

import lance

ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())

Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access, ANN search, and HTML decoding are far faster against a local copy:
hf download lance-format/natural-questions-val-lance --repo-type dataset --local-dir ./natural-questions-val-lance
Then point Lance or LanceDB at ./natural-questions-val-lance/data.

Search

The bundled IVF_PQ index on question_emb makes nearest-neighbour question lookup a single call. In production you would encode an incoming user query through the same 384-dim MiniLM encoder used at ingest and pass the resulting vector to tbl.search(...). The example below uses the embedding from row 42 as a runnable stand-in so the snippet works without loading a model.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

seed = (
    tbl.search()
    .select(["question_emb", "question"])
    .limit(1)
    .offset(42)
    .to_list()[0]
)

hits = (
    tbl.search(seed["question_emb"], vector_column_name="question_emb")
    .metric("cosine")
    .where("has_short_answer = TRUE", prefilter=True)
    .select(["question", "short_answers", "document_title"])
    .limit(10)
    .to_list()
)
for r in hits:
    print(r["question"], "->", r["short_answers"])

The result set carries only the projected columns; the 384-d question_emb is never read on the result side, and the heavy document_html is left untouched, keeping the working set small even though each row carries a full Wikipedia article inline. Because the dataset also ships an INVERTED index on question, the same query can be issued as a hybrid search that combines the dense vector with a keyword query against the question text. LanceDB merges the two result lists and reranks them in a single call, which is useful when a named entity must literally appear in the query but the dense side still does most of the ranking.

hybrid_hits = (
    tbl.search(query_type="hybrid")
    .vector(seed["question_emb"])
    .text("declaration of independence")
    .select(["question", "short_answers", "document_title"])
    .limit(10)
    .to_list()
)
for r in hybrid_hits:
    print(r["question"])

Tune metric, nprobes, and refine_factor on the vector side to trade recall against latency for your workload.

Curate

A typical curation pass over NQ starts with annotation-coverage filters before any HTML gets read. Lance evaluates the filter inside a single scan, so the candidate set comes back already filtered, and the bounded .limit(500) keeps the output small enough to inspect. The example below assembles a set of factoid questions with at least one short-answer span and a non-yes/no resolution.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

candidates = (
    tbl.search()
    .where(
        "has_short_answer = TRUE "
        "AND yes_no_answer = 'NONE' "
        "AND array_length(short_answers) >= 1 "
        "AND length(question) >= 30",
        prefilter=True,
    )
    .select(["id", "question", "short_answers", "document_title", "document_url"])
    .limit(500)
    .to_list()
)
print(f"{len(candidates)} candidates; first: {candidates[0]['question']}")

The result is a plain list of dictionaries, ready to inspect, persist as a manifest of NQ example ids, or hand to the Evolve and Train sections below. The large document_html column is not read by this scan, so a 500-row curation pass against the Hub moves only kilobytes of metadata even though each row holds an entire Wikipedia article.

Evolve

Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a question_length column, a first_short_answer_length derived from the deduped span list, and an is_factoid flag that combines the annotation flags, any of which can then be used directly in where clauses without recomputing the predicate on every query.

Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full corpus.

import lancedb

db = lancedb.connect("./natural-questions-val-lance/data")  # local copy required for writes
tbl = db.open_table("validation")

tbl.add_columns({
    "question_length": "length(question)",
    "first_short_answer_length": "length(short_answers[1])",
    "is_factoid": "has_short_answer = TRUE AND yes_no_answer = 'NONE'",
})

If the values you want to attach already live in another table (offline retriever scores, generated-answer judgments, alternate embeddings from a stronger model), merge them in by joining on id:

import pyarrow as pa

retriever_scores = pa.table({
    "id": pa.array(["797803103333068850", "5225754983651766092"]),
    "bm25_top1_score": pa.array([14.2, 8.7]),
})
tbl.merge(retriever_scores, on="id")

The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., extracting the long-answer paragraph from document_html), Lance provides a batch-UDF API — see the Lance data evolution docs.

Train

Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through Permutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For an open-domain QA reader the natural projection is the question plus the full document HTML and the answer spans; for a question-encoder retraining loop the precomputed embedding is enough on its own, and skipping document_html keeps each batch small.

import lancedb
from lancedb.permutation import Permutation
from torch.utils.data import DataLoader

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

train_ds = Permutation.identity(tbl).select_columns(["question", "document_html", "short_answers"])
loader = DataLoader(train_ds, batch_size=4, shuffle=True, num_workers=2)

for batch in loader:
    # batch carries only the projected columns; tokenize, forward, backward...
    ...

Switching feature sets is a configuration change: passing ["question_emb", "short_answers"] to select_columns(...) on the next run reads only the 384-d vectors and the answer spans, which is the right shape for fine-tuning a retrieval head on cached embeddings without paying for the multi-megabyte document_html per row. Columns added in Evolve cost nothing per batch until they are explicitly projected.

Versioning

Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

print("Current version:", tbl.version)
print("History:", tbl.list_versions())
print("Tags:", tbl.tags.list())

Once you have a local copy, tag a version for reproducibility:

local_db = lancedb.connect("./natural-questions-val-lance/data")
local_tbl = local_db.open_table("validation")
local_tbl.tags.create("factoid-v1", local_tbl.version)

A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one:

tbl_v1 = db.open_table("validation", version="factoid-v1")
tbl_v5 = db.open_table("validation", version=5)

Pinning supports two workflows. A QA system locked to factoid-v1 keeps returning stable answer spans while the dataset evolves in parallel — newly added retriever scores or labels do not change what the tag resolves to. An evaluation experiment pinned to the same tag can be rerun later against the exact same questions and articles, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.

Materialize a subset

Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full corpus. The pattern is to stream a filtered query through .to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.

import lancedb

remote_db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
remote_tbl = remote_db.open_table("validation")

batches = (
    remote_tbl.search()
    .where(
        "has_short_answer = TRUE "
        "AND yes_no_answer = 'NONE' "
        "AND array_length(short_answers) >= 1"
    )
    .select(["id", "question", "document_title", "document_url", "short_answers", "question_emb"])
    .to_batches()
)

local_db = lancedb.connect("./nq-factoid")
local_db.create_table("validation", batches)

The resulting ./nq-factoid is a first-class LanceDB database. Every snippet in the Search, Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/natural-questions-val-lance/data for ./nq-factoid. Note that this projection deliberately omits document_html; include it in the .select(...) list when the downstream task needs the article body.

Source & license

Converted from google-research-datasets/natural_questions. NQ is released under CC BY-SA 3.0 (matching the Wikipedia source).

Citation

@article{kwiatkowski2019natural,
  title={Natural Questions: A Benchmark for Question Answering Research},
  author={Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav},
  journal={Transactions of the Association for Computational Linguistics},
  year={2019}
}

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Natural Questions Validation

View on Hugging Face

Key features

Splits

Schema

Pre-built indices

Why Lance?

Load with `datasets.load_dataset`

Load with LanceDB

Load with Lance

Search

Curate

Evolve

Train

Versioning

Materialize a subset

Source & license

Citation

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

View on Hugging Face

​Key features

​Splits

​Schema

​Pre-built indices

​Why Lance?

​Load with datasets.load_dataset

​Load with LanceDB

​Load with Lance

​Search

​Curate

​Evolve

​Train

​Versioning

​Materialize a subset

​Source & license

​Citation

Key features

Splits

Schema

Pre-built indices

Why Lance?

Load with `datasets.load_dataset`

Load with LanceDB

Load with Lance

Search

Curate

Evolve

Train

Versioning

Materialize a subset

Source & license

Citation