Skip to content

feat: DuckDB extension for zvec collections (MVP)#136

Open
akuligowski9 wants to merge 1 commit intoalibaba:mainfrom
akuligowski9:feat/duckdb-extension
Open

feat: DuckDB extension for zvec collections (MVP)#136
akuligowski9 wants to merge 1 commit intoalibaba:mainfrom
akuligowski9:feat/duckdb-extension

Conversation

@akuligowski9
Copy link

Summary

  • Implements a DuckDB extension (Phase 1 MVP) per [Feature]: Add duckdb extension #134, using a collection-bridge approach — zvec owns storage, DuckDB functions act as a SQL bridge to the Collection API
  • Adds 4 SQL functions: zvec_create, zvec_insert, zvec_search (table function), zvec_fetch (table function)
  • Includes thread-safe CollectionRegistry, full zvec↔DuckDB type mapping, JSON schema parser, and SQL tests

Design Decisions (seeking feedback)

  • Collection-bridge vs native DuckDB index: Chose bridge approach for simplicity — zvec manages storage, DuckDB functions query it. This avoids deep DuckDB storage layer integration but means collections are external to DuckDB's catalog.
  • Vector representation: DuckDB FLOAT[] arrays, converted to zvec raw binary format internally.
  • JSON-based schema & insert: Schema creation and document insertion use JSON strings for flexibility. Alternative: structured DuckDB parameters.
  • Global singleton registry: Open collections are cached in a mutex-protected singleton keyed by path. This enables concurrent read access but serializes writes.

Functions

Function Type SQL Signature
zvec_create Scalar (path VARCHAR, schema_json VARCHAR) → VARCHAR
zvec_insert Scalar (path VARCHAR, pk VARCHAR, doc_json VARCHAR) → VARCHAR
zvec_search Table (path VARCHAR, field VARCHAR, vector FLOAT[], topk INT) → (pk, score, ...fields)
zvec_fetch Table (path VARCHAR, pk VARCHAR) → (pk, ...fields)

Example Usage

SELECT zvec_create('/tmp/my_col', '{"name": "articles", "fields": [
  {"name": "title", "type": "STRING"},
  {"name": "embedding", "type": "VECTOR_FP32", "dimension": 128,
   "index": {"type": "HNSW", "metric": "COSINE"}}
]}');

SELECT zvec_insert('/tmp/my_col', 'doc1',
  '{"title": "hello world", "embedding": [0.1, 0.2, ...]}');

SELECT * FROM zvec_search('/tmp/my_col', 'embedding',
  [0.1, 0.2, ...]::FLOAT[], 10);

Test plan

  • Verify make builds the extension without errors against a DuckDB submodule
  • Verify the extension loads in DuckDB CLI
  • SQL test: create collection → insert docs → search → verify score ordering
  • SQL test: create → insert → fetch by PK → verify field values
  • Test with larger collections (100+ docs)

Closes #134

🤖 Generated with Claude Code

@CLAassistant
Copy link

CLAassistant commented Feb 16, 2026

CLA assistant check
All committers have signed the CLA.

Implements a collection-bridge DuckDB extension with 4 SQL functions:
- zvec_create(path, schema_json): create collections from JSON schema
- zvec_insert(path, pk, doc_json): insert documents with JSON fields
- zvec_search(path, field, vector, topk): vector similarity search
- zvec_fetch(path, pk): fetch documents by primary key

Includes thread-safe CollectionRegistry, full type mapping between
zvec and DuckDB types, JSON schema parser, and SQL tests.

Closes alibaba#134

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add duckdb extension

2 participants