Modern machine learning models can be trained to convert raw data into embeddings, which are vectors of floating point numbers. The position of an embedding in vector space captures the semantics of the data, so vectors that are close to each other are considered similar. LanceDB provides an embedding function registry in OSS as well as its Enterprise versions (see below) that automatically generates vector embeddings during data ingestion. Automatic query-time embedding generation is available in LanceDB OSS, with SDK-specific query ergonomics. The API abstracts embedding generation, allowing you to focus on your application logic.Documentation Index
Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-namespace-typescript-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Embedding Registry
You can get a supported embedding function from the registry, and then use it in your table schema. Once configured, the embedding function will automatically generate embeddings when you insert data into the table. Query-time behavior depends on SDK: Python/TypeScript can query with text directly, while Rust examples typically compute query embeddings explicitly before vector search.Using an embedding function
Create an embedding function before you attach it to table or schema metadata. Python and TypeScript fetch provider implementations from the embedding registry, while Rust constructs the provider embedding function directly and registers it on the connection before using it in anEmbeddingDefinition.
Provider configuration is SDK-specific, so copy the option names from the provider page for the SDK you use.
For example, the OpenAI model is selected with name in Python, model in TypeScript, and the model argument
to OpenAIEmbeddingFunction::new_with_model in Rust.
| Concept | Python | TypeScript | Rust |
|---|---|---|---|
| Model | name="text-embedding-3-small" | { model: "text-embedding-3-small" } | new_with_model(api_key, "text-embedding-3-small") |
| Retry count | max_retries=7 | Provider/client-specific | Provider/client-specific |
| API key | api_key="...", environment variables, or $var: | apiKey: "...", environment variables, or $var: | Constructor argument or environment variable |
| Device | Provider-specific, for example device="cuda" | Provider-specific | Provider-specific |
$var: placeholders in embedding-function config.
This is useful for provider secrets and environment-specific settings in Python and TypeScript.
- Python uses
registry.set_var(...). - TypeScript uses
registry.setVar(...). - You can provide a fallback with
$var:name:default. - Sensitive values such as API keys should be passed through registry variables instead of hardcoding them in config.
Multiple embedding columns
A single table can include more than one embedding definition when you want to store multiple semantic views of the same data, or generate embeddings from different source columns. In practice, each embedding definition maps one source column to one vector column, and the table schema can contain multiple such pairs. The exact setup differs by SDK, but the underlying pattern is the same: define a distinct source/vector pair for each embedding function you want applied during ingest.Embedding model providers
LanceDB supports most popular embedding providers.Text embeddings
| Provider | Model ID | Default Model |
|---|---|---|
| OpenAI | openai | text-embedding-ada-002 |
| Sentence Transformers | sentence-transformers | all-MiniLM-L6-v2 |
| Hugging Face | huggingface | colbert-ir/colbertv2.0 |
| Cohere | cohere | embed-english-v3.0 |
| … | … | … |
Multimodal embedding
| Provider | Model ID | Supported Inputs |
|---|---|---|
| OpenCLIP | open-clip | Text, Images |
| ImageBind | imagebind | Text, Images, Audio, Video |
| … | … | … |
Embeddings in LanceDB Enterprise
Enterprise In LanceDB Enterprise, embedding generation during data ingestion is client-side and the resulting vectors are stored on the remote table.The Enterprise server does not currently generate embeddings from query text on its own. Any automatic
query-time embedding happens on the client side.
How string queries are interpreted
For the Python remote client,table.search("hello") can take two different paths:
- If the selected vector column has embedding metadata
(i.e., the table schema stores the source-column, vector-column, and
embedding-function mapping created from fields like
SourceField()andVectorField()during table creation), then the embeddings are computed in the Python client process. The client uses the same local LanceDB embedding registry used by OSS tables to reconstruct the embedding function from schema metadata, compute the query vector in the client process, and send that vector to Enterprise for search. - If the table does not have embedding metadata for that search,
table.search("hello")inautomode is treated as an FTS query instead.
Custom Embedding Functions
You can always implement your own embedding function:- Python/TypeScript: subclass
TextEmbeddingFunction(text) orEmbeddingFunction(multimodal). - Rust: implement the
EmbeddingFunctiontrait.