feat: add NIP-50 support#160
Conversation
|
I've been working on testing this with @dskvr , have some feedback: My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in I started indexing the db with this config: Indexing started running great, I came back this morning and my logs are getting spammed with:
Where the counter never increments. It just keeps sending this same log over and over. I tried It's possible this is a red herring log, where because of my My On the bright side, query performance is great. Querying I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets. Just my 2 sats. |
|
This is very impressive, thank you! Sounds like more testing is necessary, but yes this looks broadly like it's on the right track. |
|
@leesalminen Are you still running this branch, if so, any issues other than the debugging output bug? @hoytech The only issue I am aware of is that when interrupting an indexing operation, the debugging output is not correct when resuming. Also, there needs to be a method to destroy the index. |
@leesalminen Bugs you reported in |
Adds ISearchProvider interface, NoopSearchProvider stub, SearchProvider factory (makeSearchProvider()), and KindMatcher for kind-range filtering. RelayServer grows a unique_ptr<ISearchProvider> searchProvider field plus searchIndexerThread/searchIndexerRunning members. RelayWebsocket advertises NIP-50 in supported_nips when the provider is healthy. QueryScheduler gains the searchProvider field to pass through to DBQuery. Initialization order in cmd_relay.cpp ensures searchProvider is set before websocket threads start, avoiding any data race. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Full LMDB-backed full-text search implementation. Tokenizer.h splits text into normalized lowercase tokens (2-48 chars). LmdbSearchProvider.h implements BM25 scoring over SearchIndex (DUPSORT inverted index with packed levId:48/tf:16 postings) and SearchDocMeta (per-doc len+kind metadata). Supports configurable candidate ranking strategies and recency boost. Uses lmdb::from_sv<uint64_t>/to_sv<uint64_t> throughout for alignment-safe LMDB value reads/writes (MERGE-03: eliminates all reinterpret_cast UB). Includes dup-detection guard to prevent re-indexing already-indexed events and batch-process logging. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
SearchRunner.h integrates ISearchProvider with DBQuery for NIP-50 query execution. RelayWriter passes searchProvider to writeEvents() and includes a search indexing loop after write commit. RelayCron registers search index cleanup hooks for event expiration. cmd_relay.cpp initializes the search provider via makeSearchProvider() before websocket threads start, starts/joins the searchIndexer background thread. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Adds three search maintenance commands (auto-discovered by golpe): - search-reindex: catch-up indexer with checkpoint support, manual levId override, and batch-progress logging - search-set-state: manually set lastIndexedLevId and indexVersion in SearchState table - search-index-stats: report index size (token count, doc count, table sizes) cmd_delete.cpp gains search index cleanup to remove indexed events on manual deletion. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
golpe.yaml: add SearchIndex (DUPSORT inverted index), SearchDocMeta
(per-doc BM25 metadata, MDB_INTEGERKEY), and SearchState (index progress
tracking) tables.
src/apps/relay/golpe.yaml: add 13 relay__search__* config keys covering
enabled flag, backend, indexedKinds, BM25 ranking weights, candidate
ranking strategy, and query limits. Preserves all upstream additions:
relay__auth__enabled, relay__auth__serviceUrl, relay__maxTagsPerFilter,
relay__filterValidation__* block.
strfry.conf: add search { } config block after filterValidation block.
DBQuery.h: integrate SearchRunner and ISearchProvider; constructor accepts
optional searchProvider parameter; process() dispatches to searchRunner
when filter has search field and provider is healthy.
ActiveMonitors.h: add hasSearch() guard to skip search-only filters in
the monitor fast path.
Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
filters.h: add std::optional<std::string> search field and search key parsing to NostrFilter constructor; add hasSearch() predicate. Preserves upstream's try-catch wrapping for filter parse errors (MERGE-05), and relay__maxTagsPerFilter config key replacing hardcoded limit. events.h/events.cpp: merge writeEvents() signature to accept both logLevel (uint64_t, replaces upstream's bool logDeletions) and ISearchProvider *searchProvider. Preserves upstream's a-tag deletion handling for kind-5 parameterized replaceable events (parseATag call, replaceDeletion index). Adds searchProvider->deleteEvent() in the deletion loop for search index consistency. RelayReqWorker.cpp: pass searchProvider to QueryScheduler so DBQuery receives it at construction time for search-aware query dispatch. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Adds bench/scenarios/ (small.yml, medium.yml) for 10k and 1M event benchmark scenarios, and bench/scripts/ (prepare.sh, run.sh, report.py, sysinfo.sh) for reproducible benchmark runs with search enabled/disabled. bench/SCENARIOS.md documents the methodology. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
|
Correction to my prior comment: the squashed branch from earlier today silently regressed several pieces of upstream evolution (the rebase used a diff-apply approach that overwrote upstream's later changes in non-conflicting files). The PR has now been re-rebased using cherry-pick onto current Verification on the current branch:
Apologies for the noise; the diff against |
Overview
This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:
Architecture
Core Components
Search Provider Interface (
src/search/SearchProvider.h)LMDB Search Backend (
src/search/LmdbSearchProvider.h)Background Indexer (in
LmdbSearchProvider::runCatchupIndexer())SearchState.lastIndexedLevIdSearch Runner (
src/search/SearchRunner.h)Database Schema
New LMDB tables (defined in
golpe.yaml):Configuration
Key settings in
strfry.conf(relay.search):Supported
candidateRankingorders (desc for each component):terms-tf-recency(default)terms-recency-tftf-terms-recencytf-recency-termsrecency-terms-tfrecency-tf-termsConfiguration Parameters
enabled: Master switch for search functionalitybackend: Search provider implementation ("lmdb" or "noop")indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)maxQueryTerms: Maximum query terms parsedmaxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)maxCandidateDocs: Maximum candidates for scoringoverfetchFactor: Candidate over-fetch before post-filteringrecencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)candidateRankMode:orderorweightedcandidateRanking: Order used when mode=order(list above)rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weightedUsage
Enabling Search
Build strfry:
make -j$(nproc)Update
strfry.conf:Start strfry:
Indexing behavior:
Search Queries
Clients can issue NIP-50 search queries using the
searchfilter field:{ "kinds": [1], "search": "bitcoin lightning network", "limit": 100 }Search features:
Monitoring
Background indexer logs:
Query metrics include search-specific timings when
relay.logging.dbScanPerf = true(scan=Search).Performance Characteristics
Indexing Performance
Query Performance
maxCandidateDocsand result set sizeTuning guidelines:
maxCandidateDocsfor faster queries with slightly lower recalloverfetchFactorto improve recall for multi-token queriesBenchmark Suite
Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review
A comprehensive benchmark suite is included under `bench/`:Running Benchmarks
Prepare a test database:
This generates cryptographically valid Nostr events using
nakand ingests them into a fresh database.Run the benchmark:
bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)Generate reports:
Benchmark Metrics
Testing
Manual Testing
Index a test database:
Issue search queries via WebSocket:
Verify results are returned in relevance order
Integration Points
DBQuery.h: Search queries execute alongside traditional index scansActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)QueryScheduler.h: Search provider injected into query execution pathcmd_relay.cpp: Background indexer lifecycle managementMigration Notes
Existing Databases
For existing strfry installations:
cd golpe && ./build.sh && cd .. && makeThe indexer will automatically catch up on all existing events. Monitor logs for progress.
Rollback
To disable search without data loss:
relay.search.enabled = falsein configThe search tables remain in the database but are not used. They can be manually removed using the
mdbcommand-line tools if desired.Known Limitations
contentfield of events (does not index tags or metadata)maxCandidateDocsfor optimal performanceFuture Enhancements
Potential improvements for future iterations:
Related Issues