Datasets - LanceDB

The lance-format organization on Hugging Face publishes a growing catalog of multimodal datasets in Lance format. Each one bundles the raw data (images, audio, video, or text), pre-computed embeddings, and on-disk vector / full-text indices as first-class columns in the same dataset — so vector search, full-text search, and filtered scans work directly via hf:// URIs without downloading. This is powered under the hood by the Lance format’s native Hugging Face integration (via the pylance library). LanceDB sits on top of Lance and gives you a convenient table-style interface to query these datasets straight from the Hub:

import lancedb

db = lancedb.connect("hf://datasets/lance-format/<dataset-name>/data")
tbl = db.open_table("train")

# Vector search, full-text search, or filtered scans — directly on the Hub
results = tbl.search(query).limit(10).to_list()

Click any card below for usage examples, schema, and pre-built indices. For a complete walkthrough of the integration itself, see the Hugging Face Hub integration page.

Image Classification

MNIST

lance-format/mnist-lance — A Lance-formatted version of the classic MNIST handwritten-digit dataset covering 70,000 28×28 grayscale digits across ten balanced classes. Each row carries inline PNG bytes, the digit label, the human-readable class name, and a cosine-normalized…

CIFAR-10

lance-format/cifar10-lance — A Lance-formatted version of CIFAR-10 covering 60,000 32×32 RGB images across ten balanced object classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed…

Fashion-MNIST

lance-format/fashion-mnist-lance — A Lance-formatted version of Fashion-MNIST covering 70,000 28×28 grayscale clothing images across ten balanced apparel classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image…

Food-101

lance-format/food101-lance — A Lance-formatted version of Food-101, the fine-grained dish-classification benchmark of 101,000 photos spread evenly across 101 dish classes, sourced from ethz/food101. Each row carries the inline JPEG bytes, the integer label, the human-readable…

Oxford-IIIT Pet

lance-format/oxford-pets-lance — A Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat and dog photos across 37 breeds — sourced from pcuenq/oxford-pets. Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a…

Stanford Cars

lance-format/stanford-cars-lance — A Lance-formatted version of the Stanford Cars fine-grained benchmark — 8,144 photographs across 196 make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train. Each row carries the inline JPEG bytes, the integer class id, a…

ImageNet-1k Validation

lance-format/imagenet-1k-val-lance — A Lance-formatted version of the canonical 50,000-image ImageNet-1k (ILSVRC2012) validation split, sourced from benjamin-paine/imagenet-1k. Each row is one image with its integer class id, a string class name, and a cosine-normalized OpenCLIP image…

EuroSAT

lance-format/eurosat-lance — A Lance-formatted version of EuroSAT, the canonical Sentinel-2 RGB land-cover benchmark, sourced from blanchon/EuroSAT_RGB. Each row is a single 64×64 RGB tile with its integer class id, the human-readable class name, and a cosine-normalized…

Object Detection & Segmentation

COCO 2017 Detection

lance-format/coco-detection-2017-lance — A Lance-formatted version of the COCO 2017 object detection benchmark, sourced from detection-datasets/coco. Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names…

Pascal VOC 2012 Segmentation

lance-format/pascal-voc-2012-segmentation-lance — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split, sourced from nateraw/pascal-voc-2012. Each row pairs an inline JPEG image with the per-pixel PNG segmentation mask and a cosine-normalized OpenCLIP ViT-B-32 image…

ADE20K

lance-format/ade20k-lance — A Lance-formatted version of the full ADE20K scene parsing benchmark, sourced from 1aurent/ADE20K. Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene…

KITTI 2D Detection

lance-format/kitti-2d-detection-lance — A Lance-formatted version of the KITTI 2D Object Detection benchmark, sourced from nateraw/kitti so no manual signup or download from cvlibs.net is required. Each row is a single driving frame with inline JPEG bytes, the full set of 2D and 3D…

Image Retrieval

COCO Captions 2017

lance-format/coco-captions-2017-lance — A Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of…

Flickr30k

lance-format/flickr30k-lance — A Lance-formatted version of Flickr30k, redistributed via lmms-lab/flickr30k. Each row is one image with 5 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all…

LAION-1M

lance-format/laion-1m — A Lance-formatted slice of the LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP image embeddings (img_emb), full metadata, and a pre-built ANN index — all available directly from the Hub at…

Visual Question Answering

ChartQA

lance-format/chartqa-lance — A Lance-formatted version of ChartQA, a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via lmms-lab/ChartQA. Each row carries the chart image as inline JPEG…

DocVQA

lance-format/docvqa-lance — A Lance-formatted version of DocVQA, a benchmark for visual question answering over document images such as industry and government scans, multi-page reports, forms, and receipts, redistributed via lmms-lab/DocVQA (DocVQA config). Each row carries…

TextVQA

lance-format/textvqa-lance — A Lance-formatted version of TextVQA — visual question answering where the question requires reading text in the image (street signs, product labels, screen captures) — sourced from lmms-lab/textvqa. Each row carries the image bytes, the question…

VQAv2

lance-format/vqav2-lance — A Lance-formatted version of VQAv2 — open-ended visual question answering on COCO images — sourced from lmms-lab/VQAv2. Each row is one (image, question, 10 annotator answers) triple with paired CLIP image and question embeddings drawn from the…

GQA testdev-balanced

lance-format/gqa-testdev-balanced-lance — A Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from lmms-lab/GQA. The original redistribution ships instructions and images as separate…

Text QA

SQuAD v2

lance-format/squad-v2-lance — A Lance-formatted version of SQuAD v2 — the Stanford Question Answering Dataset with both answerable and deliberately unanswerable questions over Wikipedia passages — with MiniLM question embeddings stored inline and ready for retrieval at…

TriviaQA

lance-format/trivia-qa-lance — A Lance-formatted version of TriviaQA (rc.nocontext config) — a large reading-comprehension dataset of trivia questions paired with a canonical answer, accepted aliases, and entity-type metadata — with MiniLM question embeddings stored inline and…

HotpotQA distractor

lance-format/hotpotqa-distractor-lance — A Lance-formatted version of HotpotQA using the distractor config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8…

Natural Questions Validation

lance-format/natural-questions-val-lance — A Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored…

MS MARCO v2.1

lance-format/ms-marco-v2.1-lance — A Lance-formatted version of MS MARCO v2.1 — Microsoft’s machine-reading-comprehension benchmark built from anonymized Bing query logs. Each row is one user query, the up-to-10 candidate passages Bing retrieved for it with relevance flags, and the…

Text Corpora

FineWeb-Edu

lance-format/fineweb-edu — A Lance-formatted version of FineWeb-Edu — over 1.5 billion educational web passages with cleaned text, source metadata, language detection signals, and 384-dim text embeddings — available directly from the Hub at…

Speech

LibriSpeech clean

lance-format/librispeech-clean-lance — A Lance-formatted version of the LibriSpeech ASR clean configuration, sourced from openslr/librispeech_asr. Each row is one utterance with inline FLAC audio bytes, the reference transcript, a sentence-transformers embedding of that transcript, and…

Video

OpenVid-1M

lance-format/openvid-lance — A Lance-formatted version of the OpenVid-1M corpus — 937,957 high-quality clips with inline MP4 bytes, 1024-dim video embeddings, captions, and rich per-clip quality signals — available directly from the Hub at…

Robotics

LeRobot PushT

lance-format/lerobot-pusht-lance — A Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without changing…

LeRobot X-VLA Soft-Fold

lance-format/lerobot-xvla-soft-fold — A Lance-formatted version of lerobot/xvla-soft-fold — a multi-camera robotics dataset from the X-VLA project — packaged as three Lance tables for efficient frame-level training, episode-level trajectory loading, and direct access to the original…

Got a multimodal dataset you want to publish? Convert it to Lance and push it to the Hub! Anyone who opens it gets vector search, full-text search, and filtered scans on the data out of the box, without recreating the embeddings or indexes on their end.

Upload Lance datasets to the Hugging Face Hub

A step-by-step walkthrough on the LanceDB blog covering CLI setup, packaging your dataset, pushing to your namespace, and writing a dataset card.

Or browse the latest trending Lance datasets on Hugging Face.

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

​Image Classification

MNIST

CIFAR-10

Fashion-MNIST

Food-101

Oxford-IIIT Pet

Stanford Cars

ImageNet-1k Validation

EuroSAT

​Object Detection & Segmentation

COCO 2017 Detection

Pascal VOC 2012 Segmentation

ADE20K

KITTI 2D Detection

​Image Retrieval

COCO Captions 2017

Flickr30k

LAION-1M

​Visual Question Answering

ChartQA

DocVQA

TextVQA

VQAv2

GQA testdev-balanced

​Text QA

SQuAD v2

TriviaQA

HotpotQA distractor

Natural Questions Validation

MS MARCO v2.1

​Text Corpora

FineWeb-Edu

​Speech

LibriSpeech clean

​Video

OpenVid-1M

​Robotics

LeRobot PushT

LeRobot X-VLA Soft-Fold

​Share your own dataset

Upload Lance datasets to the Hugging Face Hub

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Share your own dataset