feat(gemini): add GeminiDenseEmbedder text embedding provider#751
feat(gemini): add GeminiDenseEmbedder text embedding provider#751chethanuk wants to merge 1 commit intovolcengine:mainfrom
Conversation
|
@qin-ctx Please review and merge this
|
qin-ptr
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds Google Gemini as a text embedding provider with excellent test coverage (107 tests) and solid error handling. However, there are three blocking issues that need to be addressed:
-
Design mismatch: Issue #566 explicitly requests multimodal retrieval support, but this PR only implements text-only embedding (
supports_multimodal = False). The PR description should clarify this is a phased implementation with a roadmap for multimodal support, or change the issue reference to indicate partial completion. -
Documentation inconsistency: Code comment claims "multimodal" but implementation is text-only.
-
CI failure: lint job failed due to formatting issues in
test_gemini_embedder.py.
Once these are resolved, the PR will be ready to merge. The implementation quality is strong, with comprehensive testing, graceful error handling, and proper SDK retry integration.
🤖 I am a bot owned by @qin-ctx.
| - OpenAI: Dense only | ||
| - Volcengine: Dense, Sparse, Hybrid | ||
| - Jina AI: Dense only | ||
| - Google Gemini: Dense only (multimodal) |
There was a problem hiding this comment.
[Bug] (blocking)
Comment says "Google Gemini: Dense only (multimodal)" but the actual implementation has GeminiDenseEmbedder.supports_multimodal = False. This is contradictory.
Should either:
- Change to "Google Gemini: Dense only (text-only; multimodal planned)"
- Or remove the "(multimodal)" annotation entirely until multimodal support is actually implemented
| # text-embedding-004: 768 fixed-dim legacy model, does not support MRL truncation | ||
| # Future gemini-embedding-*: default 3072 via _default_dimension() fallback | ||
| # Future text-embedding-*: default 768 via _default_dimension() prefix rule | ||
| supports_multimodal: bool = False # text-only; multimodal planned separately |
There was a problem hiding this comment.
[Design] (blocking)
This PR claims to close issue #566, which explicitly requests "first-class Gemini Embedding 2 support for multimodal retrieval". The issue emphasizes:
- "it should not reduce everything to plain text before embedding"
- "multimodal inputs handled as multimodal, not flattened by default"
However, this implementation sets supports_multimodal = False and only handles text. This is a fundamental mismatch between the issue requirement and the PR implementation.
Recommendation: Either:
- Update the PR description to clarify this is phase 1 (text-only) with a roadmap for phase 2 (multimodal), and change the issue reference to "Partial implementation of [Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566", keeping the issue open
- Or implement multimodal support in this PR as originally requested by the issue
The current "Closes: #566" statement is misleading and will cause users to expect multimodal functionality that doesn't exist.
| @@ -0,0 +1,479 @@ | |||
| # Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd. | |||
There was a problem hiding this comment.
[Bug] (blocking)
CI lint job failed because this file needs formatting:
Would reformat: tests/unit/test_gemini_embedder.py
Please run black tests/unit/test_gemini_embedder.py and commit the formatted version.
| if backend is not None and provider is None: | ||
| data["provider"] = backend | ||
| for key in ("query_param", "document_param"): | ||
| for key in ("input_type",): |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
The for key in ("input_type",): loop normalizes input_type to lowercase, but the new Gemini provider uses task_type instead of input_type. Consider also normalizing task_type here, or document that task_type must be uppercase (which is currently enforced in the Gemini validator at line 176).
| task_type: Optional[str] = None, | ||
| title: Optional[str] = None, | ||
| ) -> EmbedResult: | ||
| if not text or not text.strip(): |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
Returning a zero vector for empty text is a reasonable fallback, but it happens silently. Consider logging a warning so callers know their empty input was handled specially:
if not text or not text.strip():
logger.warning("Empty text passed to embed(), returning zero vector")
return EmbedResult(dense_vector=[0.0] * self._dimension)This helps with debugging when unexpected empty strings appear in production.
…fig normalization, empty-text warning - Fix "(multimodal)" annotation to "(text-only; multimodal planned)" to match supports_multimodal=False (B1) - Run black formatter on test_gemini_embedder.py (B3) - Add auto-uppercase normalization for task_type/query_param/document_param in sync_provider_backend so lowercase config values pass validation (NB1) - Add logger.warning on empty text in embed() for production debuggability (NB2) - Add test_gemini_task_type_case_insensitive test (NB1) - Fix ruff import sorting in __init__.py
|
Thanks for the thorough review! All issues addressed in commit bcfd612: @qin-ptr — blocking fixes
@qin-ptr — non-blocking (adopted both)
@ZaynJarvis — #702 pattern + #718
Verification |
1586f65 to
893a1da
Compare
|
@qin-ctx Can we merge? Main keeps changing now tests are working |
for
|
…ngine#702 pattern - GeminiDenseEmbedder: accept query_param/document_param, use is_query in embed() and embed_batch() to select task_type at call time - EmbeddingConfig: add Gemini provider, factory, validation, dimension - No get_query_embedder/get_document_embedder/_get_contextual_embedder (removed in volcengine#702; embed(is_query=True/False) is the pattern) - Tests use embed(text, is_query=True/False) pattern throughout - Rebased onto current upstream/main
98f7ae4 to
fba32cf
Compare
|
@ZaynJarvis Fixed per #702:
|
|
pls fix workflow |
| "QUESTION_ANSWERING, FACT_VERIFICATION, CODE_RETRIEVAL_QUERY. " | ||
| "For non-symmetric mode set query_param/document_param instead." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
i don't think task_type is needed, query_param and document_param is supposed to contain task_type.

Description
Phase 1 of #566: Adds
GeminiDenseEmbedder— a production-ready, opt-in text-only dense embedding provider backed by Google'sgoogle-genaiSDK. Volcengine remains the default; Gemini is enabled viaprovider: "gemini"inov.conf. No existing behaviour changes.Phased Implementation
Key capabilities
supports_multimodal = False; multimodal is Phase 2get_query_embedder()/get_document_embedder()routequery_param/document_paramto Gemini task types via factory (follows feat(embedding): combine document embedder and query embedder to avoi… #702 pattern)task_type+title— override at call time; all 8 Gemini task typestask_type,query_param,document_paramauto-uppercased in config normalizationHttpRetryOptions(attempts=3, exp_base=2, max_delay=30s)with import guard for SDK < 0.8async_embed_batch()dispatches 100-text chunks in parallel viaanyioflowchart LR subgraph Config["ov.conf / EmbeddingModelConfig"] CFG["provider: gemini\napi_key: sk-…\ntask_type: RETRIEVAL_DOCUMENT\ndimension: 1536"] end subgraph Embedder["GeminiDenseEmbedder(DenseEmbedderBase)"] direction TB BC["_build_config(task_type?, title?)"] E["embed(text, task_type?, title?)"] EB["embed_batch(texts, titles?)"] AEB["async_embed_batch(texts)"] E --> BC EB --> BC AEB --> BC end subgraph SDK["google-genai SDK"] direction TB HTTP["HttpOptions\n(retry: 3× exp backoff)"] SYNC["models.embed_content()"] ASYNC["aio.models.embed_content()\n(semaphore-bounded)"] HTTP --> SYNC HTTP --> ASYNC end Config --> Embedder BC --> SYNC AEB --> ASYNCRelated Issue
Partial implementation of #566 (Phase 1: text-only dense embedding; Phase 2: multimodal — issue stays open)
Type of Change
Changes Made
openviking/models/embedder/gemini_embedders.py—GeminiDenseEmbedder(DenseEmbedderBase):_build_config(), per-call task_type/title, SDK retry,__repr__, empty-text guard with warning logopenviking_cli/utils/config/embedding_config.py— registers"gemini"provider; validates + auto-uppercasestask_type/query_param/document_param; factory routes context for non-symmetric modeopenviking/models/embedder/__init__.py— exportsGeminiDenseEmbedder; annotation: "text-only; multimodal planned"examples/ov.conf.example— Gemini config snippetpyproject.toml—google-genai>=0.8dep; optionalgemini-asyncextra foranyiotests/unit/test_gemini_embedder.py— 41 unit tests (mock-only, no API key needed)tests/unit/test_embedding_config_gemini.py— 18 config validation tests (incl. case-insensitive task_type)tests/integration/— 49 integration tests (auto-skip withoutGOOGLE_API_KEY)Testing
test_gemini_embedder.pytest_embedding_config_gemini.pytest_gemini_e2e.pytest_gemini_embedding_it.pytest_gemini_openviking_it.pyChecklist