Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

In certain cases, you may want to retrieve documents that are semantically similar to a given query, but also prioritize specific keywords. This is an example of hybrid search, a query method that combines multiple search techniques. For detailed examples, look at this Python Notebook or the TypeScript Example

1. Setup

Import the necessary libraries and dependencies for working with LanceDB, OpenAI embeddings, and reranking.
import os
import lancedb
import openai
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

2. Connect to LanceDB

Establish a connection to your LanceDB instance, with different options for Enterprise setups or open source. OSS
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
Enterprise For LanceDB Enterprise, set the db:// URI, region and the host override to your private cloud endpoint:
host_override = os.environ.get("LANCEDB_HOST_OVERRIDE")

db = lancedb.connect(
    uri=uri,
    api_key=api_key,
    region=region,
    host_override=host_override
)

3. Configure Embedding Model

Set up the any embedding model that will convert text into vector representations for semantic search.
embeddings = get_registry().get("sentence-transformers").create()

4. Create Table & Schema

Define the data structure for your documents, including both the text content and its vector representation.
class Documents(LanceModel):
    text: str = embeddings.SourceField()
    vector: Vector(embeddings.ndims()) = embeddings.VectorField()

table_name = "hybrid_search_example"
table = db.create_table(table_name, schema=Documents, mode="overwrite")

5. Add Data

Insert sample documents into your table, which will be used for both semantic and keyword search.
data = [
    {"text": "rebel spaceships striking from a hidden base"},
    {"text": "have won their first victory against the evil Galactic Empire"},
    {"text": "during the battle rebel spies managed to steal secret plans"},
    {"text": "to the Empire's ultimate weapon the Death Star"},
]
table.add(data=data)

6. Build Full Text Index

Create a full-text search index on the text column to enable keyword-based search capabilities.
table.create_fts_index("text")
wait_for_index(table, "text_idx")

7. Set Reranker [Optional]

Initialize the reranker that will combine and rank results from both semantic and keyword search. By default, lancedb uses RRF reranker, but you can choose other rerankers like Cohere, CrossEncoder, or others lister in integrations section.
reranker = RRFReranker()
Perform a hybrid search query that combines semantic similarity with keyword matching, using the specified reranker to merge and rank the results.
results = (
    table.search(
        "flower moon",
        query_type="hybrid",
        vector_column_name="vector",
        fts_columns="text",
    )
    .rerank(reranker)
    .limit(10)
    .to_pandas()
)

print("Hybrid search results:")
print(results)

9. Hybrid Search - Explicit Vector and Text Query pattern

You can also pass the vector and text query explicitly. This is useful if you’re not using the embedding API or if you’re using a separate embedder service.
vector_query = [0.1, 0.2, 0.3, 0.4, 0.5]
text_query = "flower moon"
(
    table.search(query_type="hybrid")
    .vector(vector_query)
    .text(text_query)
    .limit(5)
    .to_pandas()
)

Query controls

Hybrid queries inherit the same builder API as vector and FTS queries, so the same knobs for filtering, distance bounds, and row identity apply. These compose with .rerank(...) and the explicit .vector() / .text() form shown above.

Returning row IDs

Pass with_row_id(True) (Python) or withRowId() (TypeScript) to include the internal _rowid column in the results. This is useful for joining hybrid results back to a primary table, or for deduping across multiple queries:
results = (
    table.search("flower moon", query_type="hybrid")
    .with_row_id(True)
    .limit(10)
    .to_pandas()
)
# results now contains a `_rowid` column alongside `_relevance_score`

Bounding vector distance

distance_range(lower, upper) (Python) and distanceRange(lower, upper) (TypeScript) constrain the vector half of the hybrid query to the half-open interval [lower, upper). This is helpful when you want to cap how far semantic candidates can drift from the query vector before reranking:
results = (
    table.search("flower moon", query_type="hybrid")
    .distance_range(lower_bound=0.0, upper_bound=0.4)
    .limit(10)
    .to_pandas()
)
Either bound can be omitted to leave that side unbounded.

Prefilter vs. postfilter

When the query carries a metadata filter via where(...), you can choose whether the filter runs before or after the vector and FTS sub-queries. Prefiltering (the default) applies where to the candidate set before scoring, which is usually what you want — it shrinks the working set and benefits from any scalar indexes on the filter columns. Postfiltering runs the filter on the already-ranked top-k from each sub-query; this can be faster when the filter is non-selective or unindexed, but it may return fewer than limit rows because some of the top-k may be filtered out.
# Prefilter (default): filter applied before scoring
table.search("flower moon", query_type="hybrid") \
    .where("category = 'film'", prefilter=True) \
    .limit(10) \
    .to_pandas()

# Postfilter: filter applied after the sub-queries return top-k
table.search("flower moon", query_type="hybrid") \
    .where("category = 'film'", prefilter=False) \
    .limit(10) \
    .to_pandas()
The choice gets baked into both sub-queries, so the vector and FTS halves see the filter applied the same way. Use explain_plan on a hybrid query to see whether the filter pushed into the scan or ran as a separate FilterExec step.

More on Reranking

You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB comes with built-in rerankers and you can implement your own custom reranker as well. By default, LanceDB uses RRFReranker(), which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here’s how you can use any of the available rerankers:
ArgumentTypeDefaultDescription
normalizestr"score"The method to normalize the scores. Can be rank or score. If rank, the scores are converted to ranks and then normalized. If score, the scores are normalized directly.
rerankerRerankerRRF()The reranker to use. If not specified, the default reranker is used.