feat: add NIP-50 support by dskvr · Pull Request #160 · hoytech/strfry

dskvr · 2025-11-12T13:35:01Z

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

Full-text search with relevance ranking (BM25 algorithm)
Configurable search backends (LMDB, Noop)
Background indexer with catch-up mechanism
Production-ready performance optimizations
benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

Abstract interface allowing pluggable search backends
Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

Inverted index stored in LMDB tables
Token-based posting lists with term frequency data
Document metadata for BM25 scoring (document length, kind)
Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

Async worker thread that catches up indexing of historical events
Clean shutdown and progress persistence via SearchState.lastIndexedLevId
Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

Executes search queries within the existing query scheduler
Integrates alongside traditional index scans
Validates content by requiring presence of all parsed query tokens in event text
BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

terms-tf-recency (default)
terms-recency-tf
tf-terms-recency
tf-recency-terms
recency-terms-tf
recency-tf-terms

Configuration Parameters

enabled: Master switch for search functionality
backend: Search provider implementation ("lmdb" or "noop")
indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
maxQueryTerms: Maximum query terms parsed
maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
maxCandidateDocs: Maximum candidates for scoring
overfetchFactor: Candidate over-fetch before post-filtering
recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
candidateRankMode: order or weighted
candidateRanking: Order used when mode=order (list above)
rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

Build strfry:
```
make -j$(nproc)
```

Update strfry.conf:

relay {
    search {
        enabled = true
        backend = "lmdb"
    }
}

Start strfry:
```
./build/strfry relay
```

Indexing behavior:

New events are indexed on write (writer path)
Background indexer catches up historical events and updates SearchState
NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

Multi-token queries with BM25 relevance scoring
Case-insensitive matching
Results ranked by relevance
Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

Tokenization: ~10-15 us/event (depends on content length)
Index insertion: ~50-100 us/event (LMDB commit overhead)
Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

Lower maxCandidateDocs for faster queries with slightly lower recall
Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review

A comprehensive benchmark suite is included under `bench/`:

bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

Prepare a test database:
```
bench/scripts/prepare.sh -s scenarios/small.yml --workers 4
```
This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

Run the benchmark:

bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)

Generate reports:

bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

Throughput: events/s sent and delivered
Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
Resource usage: RSS memory, CPU utilization, disk I/O
Search-specific: index catch-up state, results cardinality
System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

Index a test database:

# Import some events
cat events.ndjson | ./build/strfry import

# Start relay with search enabled
./build/strfry relay

Issue search queries via WebSocket:

["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]

Verify results are returned in relevance order

Integration Points

DBQuery.h: Search queries execute alongside traditional index scans
ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
QueryScheduler.h: Search provider injected into query execution path
cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

Stop the relay
Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
Enable search in config
Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

Set relay.search.enabled = false in config
Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

Search is limited to content field of events (does not index tags or metadata)
No phrase matching or proximity operators (only individual tokens)
No stemming or lemmatization (exact token matching)
Large result sets may require tuning maxCandidateDocs for optimal performance
Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

Phrase search and proximity operators
Stemming and language-specific analyzers
Alternative backends (e.g., external Elasticsearch/MeiliSearch)
Search query cost accounting for rate limiting

Related Issues

Potentially Resolves Request: NIP-50 Support #40
Implements NIP-50 as specified at: https://github.com/nostr-protocol/nips/blob/master/50.md

leesalminen · 2025-11-18T12:57:27Z

I've been working on testing this with @dskvr , have some feedback:

My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in sandwichfarm/feature/nip-50-indexertweaks, which is the branch I've continued testing on.

I started indexing the db with this config:

    search {
        # Enable NIP-50 search capability (requires search backend)
        enabled = true

        # Search backend to use: lmdb, noop (or external in future)
        backend = "lmdb"

        # Maximum number of search terms allowed in a query
        maxQueryTerms = 6

        # Comma-separated kinds/ranges to index. Supports: single (1), ranges (1000-1999), wildcard (*), exclusions (-5000-5999)
        indexedKinds = "0,1,34236,30000-30003,30023,34550"

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

        # Recency tie-breaker percent (0–100); 1 = 1% boost for newest events
        recencyBoostPercent = 1

        # Over-fetch multiplier to compensate for post-filtering (candidates = limit × factor, bounded by maxCandidateDocs)
        overfetchFactor = 5

        # Candidate ranking order before scoring: terms-tf-recency | terms-recency-tf | tf-terms-recency | tf-recency-terms | recency-terms-tf | recency-tf-terms
        candidateRanking = "terms-tf-recency"

        # Candidate ranking mode: order | weighted
        candidateRankMode = "weighted"

        # Weighted ranking weights (only used when candidateRankMode = "weighted")
        rankWeightTerms = 100
        rankWeightTf = 50
        rankWeightRecency = 10
    }

Indexing started running great, I came back this morning and my logs are getting spammed with:

[ 8B7FE6C0]INFO| Search indexer catching up: 13070001 to 13071000 (head: 18740192)

Where the counter never increments. It just keeps sending this same log over and over.

I tried search_set_state and incrementing by 1 and restart relay, but the logging issue persists.

It's possible this is a red herring log, where because of my indexedKinds filter, it's not counting up correctly.

My search_index_stats are:

Search index LMDB statistics:
  SearchIndex:
  entries        : 6375268
  depth          : 4
  branch pages   : 1430
  leaf pages     : 115305
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 478146560 bytes (456.00 MiB)
  SearchDocMeta:
  entries        : 6331151
  depth          : 4
  branch pages   : 687
  leaf pages     : 78768
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 325447680 bytes (310.37 MiB)
SearchState:
  lastIndexedLevId : 13070000
  indexVersion     : 1

On the bright side, query performance is great. Querying ["REQ", "test", { "search": "taylor swift" } ] is nearly instant, barely noticeable performance hit.

I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets.

Just my 2 sats.

hoytech · 2026-02-27T16:13:37Z

This is very impressive, thank you! Sounds like more testing is necessary, but yes this looks broadly like it's on the right track.

dskvr · 2026-03-02T18:14:17Z

@leesalminen Are you still running this branch, if so, any issues other than the debugging output bug?

@hoytech The only issue I am aware of is that when interrupting an indexing operation, the debugging output is not correct when resuming.

Also, there needs to be a method to destroy the index.

dskvr · 2026-03-03T12:03:43Z

Fixed infinite loop on missing events: The catch-up indexer now always advances lastProcessedLevId regardless of whether an event's payload exists, and persists progress at batch end — preventing the indexer from looping forever on sparse levId ranges.
Eliminated per-event write transactions for skipped events: Replaced individual LMDB write+fsync operations for each filtered/errored event with a single batch-end persist, dramatically reducing I/O on relays with restrictive indexedKinds.
Always log batch progress: Changed logging from only reporting when indexed > 0 to always showing indexed=N skipped=M range=[start..end] head=H, so the indexer no longer appears stuck when filtering large batches.
Added duplicate-detection to prevent MDB_APPENDDUP conflicts: indexEventWithTxnHook now checks if an event was already indexed (by the on-write path) before inserting postings, avoiding MDB_KEYEXIST errors when the catch-up indexer re-encounters events the live writer already handled.

@leesalminen Bugs you reported in feature/nip-50-indexertweaks should now be resolved and is now merged into this branch (feature/nip-50)

Adds ISearchProvider interface, NoopSearchProvider stub, SearchProvider factory (makeSearchProvider()), and KindMatcher for kind-range filtering. RelayServer grows a unique_ptr<ISearchProvider> searchProvider field plus searchIndexerThread/searchIndexerRunning members. RelayWebsocket advertises NIP-50 in supported_nips when the provider is healthy. QueryScheduler gains the searchProvider field to pass through to DBQuery. Initialization order in cmd_relay.cpp ensures searchProvider is set before websocket threads start, avoiding any data race. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Full LMDB-backed full-text search implementation. Tokenizer.h splits text into normalized lowercase tokens (2-48 chars). LmdbSearchProvider.h implements BM25 scoring over SearchIndex (DUPSORT inverted index with packed levId:48/tf:16 postings) and SearchDocMeta (per-doc len+kind metadata). Supports configurable candidate ranking strategies and recency boost. Uses lmdb::from_sv<uint64_t>/to_sv<uint64_t> throughout for alignment-safe LMDB value reads/writes (MERGE-03: eliminates all reinterpret_cast UB). Includes dup-detection guard to prevent re-indexing already-indexed events and batch-process logging. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

SearchRunner.h integrates ISearchProvider with DBQuery for NIP-50 query execution. RelayWriter passes searchProvider to writeEvents() and includes a search indexing loop after write commit. RelayCron registers search index cleanup hooks for event expiration. cmd_relay.cpp initializes the search provider via makeSearchProvider() before websocket threads start, starts/joins the searchIndexer background thread. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Adds three search maintenance commands (auto-discovered by golpe): - search-reindex: catch-up indexer with checkpoint support, manual levId override, and batch-progress logging - search-set-state: manually set lastIndexedLevId and indexVersion in SearchState table - search-index-stats: report index size (token count, doc count, table sizes) cmd_delete.cpp gains search index cleanup to remove indexed events on manual deletion. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

golpe.yaml: add SearchIndex (DUPSORT inverted index), SearchDocMeta (per-doc BM25 metadata, MDB_INTEGERKEY), and SearchState (index progress tracking) tables. src/apps/relay/golpe.yaml: add 13 relay__search__* config keys covering enabled flag, backend, indexedKinds, BM25 ranking weights, candidate ranking strategy, and query limits. Preserves all upstream additions: relay__auth__enabled, relay__auth__serviceUrl, relay__maxTagsPerFilter, relay__filterValidation__* block. strfry.conf: add search { } config block after filterValidation block. DBQuery.h: integrate SearchRunner and ISearchProvider; constructor accepts optional searchProvider parameter; process() dispatches to searchRunner when filter has search field and provider is healthy. ActiveMonitors.h: add hasSearch() guard to skip search-only filters in the monitor fast path. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

filters.h: add std::optional<std::string> search field and search key parsing to NostrFilter constructor; add hasSearch() predicate. Preserves upstream's try-catch wrapping for filter parse errors (MERGE-05), and relay__maxTagsPerFilter config key replacing hardcoded limit. events.h/events.cpp: merge writeEvents() signature to accept both logLevel (uint64_t, replaces upstream's bool logDeletions) and ISearchProvider *searchProvider. Preserves upstream's a-tag deletion handling for kind-5 parameterized replaceable events (parseATag call, replaceDeletion index). Adds searchProvider->deleteEvent() in the deletion loop for search index consistency. RelayReqWorker.cpp: pass searchProvider to QueryScheduler so DBQuery receives it at construction time for search-aware query dispatch. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Adds bench/scenarios/ (small.yml, medium.yml) for 10k and 1M event benchmark scenarios, and bench/scripts/ (prepare.sh, run.sh, report.py, sysinfo.sh) for reproducible benchmark runs with search enabled/disabled. bench/SCENARIOS.md documents the methodology. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

dskvr · 2026-04-27T23:26:33Z

Correction to my prior comment: the squashed branch from earlier today silently regressed several pieces of upstream evolution (the rebase used a diff-apply approach that overwrote upstream's later changes in non-conflicting files). The PR has now been re-rebased using cherry-pick onto current master, which preserves upstream's evolution properly. New commit chain:

47d29e4 feat: add SearchProvider abstraction and NIP-50 relay integration
4eb2b4d feat: add LmdbSearchProvider with BM25 scoring
ee8d095 feat: wire NIP-50 search indexer and runner into relay
792e61d feat: add search dbutils commands
5df178d feat: add search LMDB schema and config plumbing
d74fc58 feat: add NIP-50 filter parsing and query path integration
c67cd48 chore: add bench scenarios and search benchmark scripts

Verification on the current branch:

Test	Result
`make -j16`	clean, exit 0
`perl test/writeTest.pl` (full 408-line upstream version)	30/30 pass, including all 11 `a`-tag deletion sub-tests
`perl test/filterFuzzTest.pl scan-limit`	767 MATCH OK / 0 MISMATCH (45s)
`perl test/filterFuzzTest.pl scan`	500 MATCH OK / 0 MISMATCH (30s)
`perl test/filterFuzzTest.pl monitor`	202 MATCH OK / 0 MISMATCH (30s)
End-to-end NIP-50 search via `nak --search`	returns expected hits, no false positives
NIP-11 `supported_nips`	`[1,2,4,9,11,22,28,40,45,50,70,77]`

Apologies for the noise; the diff against master should now be just the NIP-50 surface plus the bench scripts.

dskvr marked this pull request as ready for review November 12, 2025 14:18

WavlakeCypher mentioned this pull request Mar 23, 2026

upstream: strfry NIP-50 full-text search — music discovery on relay.wavlake.com wavlake/monorepo#1238

Closed

WavlakeCypher mentioned this pull request Apr 14, 2026

[TRACKING] strfry upstream PRs wavlake/monorepo#1513

Open

3 tasks

dskvr force-pushed the feature/nip-50 branch from a2d41cc to 0aa429c Compare April 27, 2026 22:58

dskvr marked this pull request as draft April 27, 2026 23:02

dskvr force-pushed the feature/nip-50 branch from 0aa429c to a2d41cc Compare April 27, 2026 23:04

dskvr added 7 commits April 28, 2026 01:14

dskvr force-pushed the feature/nip-50 branch from a2d41cc to c67cd48 Compare April 27, 2026 23:25

dskvr marked this pull request as ready for review April 27, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NIP-50 support#160

feat: add NIP-50 support#160
dskvr wants to merge 7 commits intohoytech:masterfrom
sandwichfarm:feature/nip-50

dskvr commented Nov 12, 2025 •

edited

Loading

Uh oh!

leesalminen commented Nov 18, 2025 •

edited

Loading

Uh oh!

hoytech commented Feb 27, 2026

Uh oh!

dskvr commented Mar 2, 2026 •

edited

Loading

Uh oh!

dskvr commented Mar 3, 2026 •

edited

Loading

Uh oh!

dskvr commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dskvr commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Core Components

Database Schema

Configuration

Configuration Parameters

Usage

Enabling Search

Search Queries

Monitoring

Performance Characteristics

Indexing Performance

Query Performance

Benchmark Suite

Running Benchmarks

Benchmark Metrics

Testing

Manual Testing

Integration Points

Migration Notes

Existing Databases

Rollback

Known Limitations

Future Enhancements

Related Issues

Uh oh!

leesalminen commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hoytech commented Feb 27, 2026

Uh oh!

dskvr commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dskvr commented Nov 12, 2025 •

edited

Loading

leesalminen commented Nov 18, 2025 •

edited

Loading

dskvr commented Mar 2, 2026 •

edited

Loading

dskvr commented Mar 3, 2026 •

edited

Loading

dskvr commented Apr 27, 2026 •

edited

Loading