COCO 2017 Detection

View on Hugging Face

Source dataset card and downloadable files for lance-format/coco-detection-2017-lance.

A Lance-formatted version of the COCO 2017 object detection benchmark, sourced from detection-datasets/coco. Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names, per-object areas, an OpenCLIP image embedding, and pre-built indices — all available directly from the Hub at hf://datasets/lance-format/coco-detection-2017-lance/data.

Key features

Inline JPEG bytes in the image column — no sidecar files, no image folders.
Per-object annotations as parallel list columns — bboxes, categories, category_names, and areas are aligned position-for-position, so iterating boxes alongside their labels is a single row read.
Pre-aggregated annotation summaries — num_objects (int) and categories_present (deduped string list) precompute the predicates curation queries hit most.
CLIP image embeddings (image_emb, OpenCLIP ViT-B/32, 512-d, cosine-normalized) with a bundled IVF_PQ index for visual retrieval.

Splits

Split	Rows
`train.lance`	117,000+
`val.lance`	4,950+

Total annotated boxes: ~860k train / ~37k val.

Schema

Column	Type	Notes
`id`	`int64`	Row index within split
`image`	`large_binary`	Inline JPEG bytes
`image_id`	`int64`	COCO image id (natural join key)
`width`, `height`	`int32`	Image dimensions in pixels
`bboxes`	`list<list<float32, 4>>`	Each box is `[x_min, y_min, x_max, y_max]` in absolute pixel coordinates
`categories`	`list<int32>`	COCO 80-class id (0–79), aligned with `bboxes`
`category_names`	`list<string>`	Human-readable class name per object (e.g. `person`, `dog`)
`areas`	`list<float32>`	Bounding-box area in pixels², aligned with `bboxes`
`num_objects`	`int32`	Number of annotated objects in the image
`categories_present`	`list<string>`	Deduped class names — feeds the `LABEL_LIST` index
`image_emb`	`fixed_size_list<float32, 512>`	OpenCLIP ViT-B/32 image embedding (cosine-normalized)

Pre-built indices

IVF_PQ on image_emb — vector similarity search (cosine)
BTREE on image_id — fast lookup by COCO image id
BTREE on num_objects — range filters on image complexity
LABEL_LIST on categories_present — supports array_has_any / array_has_all for class-presence filtering

Why Lance?

Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.

Load with `datasets.load_dataset`

You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample without installing anything Lance-specific.

import datasets

hf_ds = datasets.load_dataset("lance-format/coco-detection-2017-lance", split="val", streaming=True)
for row in hf_ds.take(3):
    print(row["image_id"], row["num_objects"], row["categories_present"][:5])

Load with LanceDB

LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Versioning, and Materialize-a-subset sections below.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
tbl = db.open_table("val")
print(len(tbl))

Load with Lance

pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, the list of pre-built indices.

import lance

ds = lance.dataset("hf://datasets/lance-format/coco-detection-2017-lance/data/val.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())

Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access, ANN search, and any mutation are far faster against a local copy:
hf download lance-format/coco-detection-2017-lance --repo-type dataset --local-dir ./coco-detection-2017-lance
Then point Lance or LanceDB at ./coco-detection-2017-lance/data.

Search

The bundled IVF_PQ index on image_emb makes approximate-nearest-neighbor visual retrieval a single call. In production you would encode a query image through the same OpenCLIP ViT-B/32 model used at ingest and pass the resulting 512-d vector to tbl.search(...). The example below uses the embedding stored on row 42 as a runnable stand-in, so the snippet works without loading any model.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
tbl = db.open_table("val")

seed = (
    tbl.search()
    .select(["image_emb", "image_id", "categories_present"])
    .limit(1)
    .offset(42)
    .to_list()[0]
)

hits = (
    tbl.search(seed["image_emb"])
    .metric("cosine")
    .select(["image_id", "categories_present", "num_objects"])
    .limit(10)
    .to_list()
)
print("query categories:", seed["categories_present"])
for r in hits:
    print(f"  image_id={r['image_id']:>10}  n={r['num_objects']:>3}  cats={r['categories_present'][:5]}")

Because the embeddings are cosine-normalized, the first hit will typically be the source image itself — a useful sanity check. Tune nprobes and refine_factor to trade recall against latency for your workload.

Curate

Curation for a detection workflow usually means picking images that contain a specific class combination, possibly bounded by scene complexity. The LABEL_LIST index on categories_present makes class-presence predicates trivial, and Lance evaluates them inside the same scan as range filters on num_objects or width/height. The bounded .limit(500) keeps the result small and inspectable, and the image column is left out of the projection so the candidate scan is dominated by annotation metadata, not JPEG bytes.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
tbl = db.open_table("val")

candidates = (
    tbl.search()
    .where(
        "array_has_all(categories_present, ['person', 'frisbee']) "
        "AND num_objects BETWEEN 3 AND 12",
        prefilter=True,
    )
    .select(["image_id", "categories_present", "num_objects", "width", "height"])
    .limit(500)
    .to_list()
)
print(f"{len(candidates)} candidates; first image_id: {candidates[0]['image_id']}")

The result is a plain list of dictionaries, ready to inspect, persist as a manifest of image_ids, or feed into the Evolve and Train workflows below. Swapping array_has_all for array_has_any widens recall to images containing any of the listed classes; replacing the structural predicate with num_objects >= 10 selects busy scenes for crowd-detection ablations.

Evolve

Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds a has_person flag, an aspect_ratio, and a max_box_area that surfaces the largest annotated object area per image — all of which can then be used directly in where clauses without recomputing the predicate on every query.

Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full split first.

import lancedb

db = lancedb.connect("./coco-detection-2017-lance/data")  # local copy required for writes
tbl = db.open_table("val")

tbl.add_columns({
    "has_person": "array_has_any(categories_present, ['person'])",
    "aspect_ratio": "CAST(width AS DOUBLE) / CAST(height AS DOUBLE)",
    "max_box_area": "array_max(areas)",
    "crowded": "num_objects >= 10",
})

If the values you want to attach already live in another table (offline predictions from a baseline detector, per-image difficulty scores, or a second-pass embedding), merge them in by joining on image_id:

import pyarrow as pa

predictions = pa.table({
    "image_id": pa.array([397133, 37777, 252219], type=pa.int64()),
    "baseline_map": pa.array([0.31, 0.48, 0.22]),
})
tbl.merge(predictions, on="image_id")

The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a second detector over the image bytes), Lance provides a batch-UDF API — see the Lance data evolution docs.

Train

Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through Permutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For a detector training run, project the JPEG bytes alongside the parallel annotation columns the loss consumes — boxes, category ids, and (optionally) areas. Columns added in the Evolve section above cost nothing per batch until they are explicitly projected.

import lancedb
from lancedb.permutation import Permutation
from torch.utils.data import DataLoader

db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
tbl = db.open_table("train")

train_ds = Permutation.identity(tbl).select_columns(
    ["image", "bboxes", "categories", "areas"]
)
loader = DataLoader(train_ds, batch_size=16, shuffle=True, num_workers=4,
                    collate_fn=lambda b: b)  # detection targets are ragged

for batch in loader:
    # batch is a list of dicts: decode each JPEG, stack the bboxes / categories
    # into the target dictionary your detector expects, forward, loss...
    ...

Switching feature sets is a configuration change: passing ["image_emb", "categories_present"] to select_columns(...) on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors plus the deduped class list, which is the right shape for training a lightweight multi-label classifier or a class-presence probe on top of frozen features.

Versioning

Every mutation to a Lance dataset, whether it adds a column, merges labels, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
tbl = db.open_table("val")

print("Current version:", tbl.version)
print("History:", tbl.list_versions())
print("Tags:", tbl.tags.list())

Once you have a local copy, tag a version for reproducibility:

local_db = lancedb.connect("./coco-detection-2017-lance/data")
local_tbl = local_db.open_table("val")
local_tbl.tags.create("detector-baseline-v1", local_tbl.version)

A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one:

tbl_v1 = db.open_table("val", version="detector-baseline-v1")
tbl_v5 = db.open_table("val", version=5)

Pinning supports two workflows. An evaluation harness locked to detector-baseline-v1 keeps scoring against the exact same boxes and category ids while the dataset evolves in parallel; newly merged predictions or evolved columns do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same images and annotations, so changes in mAP reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.

Materialize a subset

Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through .to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.

import lancedb

remote_db = lancedb.connect("hf://datasets/lance-format/coco-detection-2017-lance/data")
remote_tbl = remote_db.open_table("train")

batches = (
    remote_tbl.search()
    .where("array_has_any(categories_present, ['dog', 'cat']) AND num_objects >= 2")
    .select(["image_id", "image", "bboxes", "categories", "category_names",
             "areas", "num_objects", "categories_present", "image_emb"])
    .to_batches()
)

local_db = lancedb.connect("./coco-pets-subset")
local_db.create_table("train", batches)

The resulting ./coco-pets-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/coco-detection-2017-lance/data for ./coco-pets-subset.

Source & license

Converted from detection-datasets/coco. COCO annotations are released under CC BY 4.0; the underlying images are subject to Flickr terms of service. See the COCO Terms of Use before redistribution.

Citation

@inproceedings{lin2014microsoft,
  title={Microsoft COCO: Common objects in context},
  author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2014}
}

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

COCO 2017 Detection

View on Hugging Face

Key features

Splits

Schema

Pre-built indices

Why Lance?

Load with `datasets.load_dataset`

Load with LanceDB

Load with Lance

Search

Curate

Evolve

Train

Versioning

Materialize a subset

Source & license

Citation

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

View on Hugging Face

​Key features

​Splits

​Schema

​Pre-built indices

​Why Lance?

​Load with datasets.load_dataset

​Load with LanceDB

​Load with Lance

​Search

​Curate

​Evolve

​Train

​Versioning

​Materialize a subset

​Source & license

​Citation

Key features

Splits

Schema

Pre-built indices

Why Lance?

Load with `datasets.load_dataset`

Load with LanceDB

Load with Lance

Search

Curate

Evolve

Train

Versioning

Materialize a subset

Source & license

Citation