Skip to content

feat: add NIP-50 support#160

Open
dskvr wants to merge 7 commits intohoytech:masterfrom
sandwichfarm:feature/nip-50
Open

feat: add NIP-50 support#160
dskvr wants to merge 7 commits intohoytech:masterfrom
sandwichfarm:feature/nip-50

Conversation

@dskvr
Copy link
Copy Markdown

@dskvr dskvr commented Nov 12, 2025

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

  • Full-text search with relevance ranking (BM25 algorithm)
  • Configurable search backends (LMDB, Noop)
  • Background indexer with catch-up mechanism
  • Production-ready performance optimizations
  • benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

  • Abstract interface allowing pluggable search backends
  • Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

  • Inverted index stored in LMDB tables
  • Token-based posting lists with term frequency data
  • Document metadata for BM25 scoring (document length, kind)
  • Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

  • Async worker thread that catches up indexing of historical events
  • Clean shutdown and progress persistence via SearchState.lastIndexedLevId
  • Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

  • Executes search queries within the existing query scheduler
  • Integrates alongside traditional index scans
  • Validates content by requiring presence of all parsed query tokens in event text
  • BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

  • terms-tf-recency (default)
  • terms-recency-tf
  • tf-terms-recency
  • tf-recency-terms
  • recency-terms-tf
  • recency-tf-terms

Configuration Parameters

  • enabled: Master switch for search functionality
  • backend: Search provider implementation ("lmdb" or "noop")
  • indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
  • maxQueryTerms: Maximum query terms parsed
  • maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
  • maxCandidateDocs: Maximum candidates for scoring
  • overfetchFactor: Candidate over-fetch before post-filtering
  • recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
  • candidateRankMode: order or weighted
  • candidateRanking: Order used when mode=order (list above)
  • rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

  1. Build strfry:

    make -j$(nproc)
  2. Update strfry.conf:

    relay {
        search {
            enabled = true
            backend = "lmdb"
        }
    }
    
  3. Start strfry:

    ./build/strfry relay

Indexing behavior:

  • New events are indexed on write (writer path)
  • Background indexer catches up historical events and updates SearchState
  • NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

  • Multi-token queries with BM25 relevance scoring
  • Case-insensitive matching
  • Results ranked by relevance
  • Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

  • Tokenization: ~10-15 us/event (depends on content length)
  • Index insertion: ~50-100 us/event (LMDB commit overhead)
  • Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

  • Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
  • Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
  • Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

  • Lower maxCandidateDocs for faster queries with slightly lower recall
  • Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review A comprehensive benchmark suite is included under `bench/`:
bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

  1. Prepare a test database:

    bench/scripts/prepare.sh -s scenarios/small.yml --workers 4

    This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

  2. Run the benchmark:

    bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)
  3. Generate reports:

    bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

  • Throughput: events/s sent and delivered
  • Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
  • Resource usage: RSS memory, CPU utilization, disk I/O
  • Search-specific: index catch-up state, results cardinality
  • System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

  1. Index a test database:

    # Import some events
    cat events.ndjson | ./build/strfry import
    
    # Start relay with search enabled
    ./build/strfry relay
  2. Issue search queries via WebSocket:

    ["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]
  3. Verify results are returned in relevance order

Integration Points

  • DBQuery.h: Search queries execute alongside traditional index scans
  • ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
  • QueryScheduler.h: Search provider injected into query execution path
  • cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

  1. Stop the relay
  2. Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
  3. Enable search in config
  4. Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

  1. Set relay.search.enabled = false in config
  2. Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

  • Search is limited to content field of events (does not index tags or metadata)
  • No phrase matching or proximity operators (only individual tokens)
  • No stemming or lemmatization (exact token matching)
  • Large result sets may require tuning maxCandidateDocs for optimal performance
  • Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

  • Phrase search and proximity operators
  • Stemming and language-specific analyzers
  • Alternative backends (e.g., external Elasticsearch/MeiliSearch)
  • Search query cost accounting for rate limiting

Related Issues

@dskvr dskvr marked this pull request as ready for review November 12, 2025 14:18
@leesalminen
Copy link
Copy Markdown

leesalminen commented Nov 18, 2025

I've been working on testing this with @dskvr , have some feedback:

My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in sandwichfarm/feature/nip-50-indexertweaks, which is the branch I've continued testing on.

I started indexing the db with this config:

    search {
        # Enable NIP-50 search capability (requires search backend)
        enabled = true

        # Search backend to use: lmdb, noop (or external in future)
        backend = "lmdb"

        # Maximum number of search terms allowed in a query
        maxQueryTerms = 6

        # Comma-separated kinds/ranges to index. Supports: single (1), ranges (1000-1999), wildcard (*), exclusions (-5000-5999)
        indexedKinds = "0,1,34236,30000-30003,30023,34550"

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

        # Recency tie-breaker percent (0–100); 1 = 1% boost for newest events
        recencyBoostPercent = 1

        # Over-fetch multiplier to compensate for post-filtering (candidates = limit × factor, bounded by maxCandidateDocs)
        overfetchFactor = 5

        # Candidate ranking order before scoring: terms-tf-recency | terms-recency-tf | tf-terms-recency | tf-recency-terms | recency-terms-tf | recency-tf-terms
        candidateRanking = "terms-tf-recency"

        # Candidate ranking mode: order | weighted
        candidateRankMode = "weighted"

        # Weighted ranking weights (only used when candidateRankMode = "weighted")
        rankWeightTerms = 100
        rankWeightTf = 50
        rankWeightRecency = 10
    }
 

Indexing started running great, I came back this morning and my logs are getting spammed with:

[ 8B7FE6C0]INFO| Search indexer catching up: 13070001 to 13071000 (head: 18740192)

Where the counter never increments. It just keeps sending this same log over and over.

I tried search_set_state and incrementing by 1 and restart relay, but the logging issue persists.

It's possible this is a red herring log, where because of my indexedKinds filter, it's not counting up correctly.

My search_index_stats are:

Search index LMDB statistics:
  SearchIndex:
  entries        : 6375268
  depth          : 4
  branch pages   : 1430
  leaf pages     : 115305
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 478146560 bytes (456.00 MiB)
  SearchDocMeta:
  entries        : 6331151
  depth          : 4
  branch pages   : 687
  leaf pages     : 78768
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 325447680 bytes (310.37 MiB)
SearchState:
  lastIndexedLevId : 13070000
  indexVersion     : 1
  

On the bright side, query performance is great. Querying ["REQ", "test", { "search": "taylor swift" } ] is nearly instant, barely noticeable performance hit.

I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets.

Just my 2 sats.

@hoytech
Copy link
Copy Markdown
Owner

hoytech commented Feb 27, 2026

This is very impressive, thank you! Sounds like more testing is necessary, but yes this looks broadly like it's on the right track.

@dskvr
Copy link
Copy Markdown
Author

dskvr commented Mar 2, 2026

@leesalminen Are you still running this branch, if so, any issues other than the debugging output bug?

@hoytech The only issue I am aware of is that when interrupting an indexing operation, the debugging output is not correct when resuming.

Also, there needs to be a method to destroy the index.

@dskvr
Copy link
Copy Markdown
Author

dskvr commented Mar 3, 2026

  • Fixed infinite loop on missing events: The catch-up indexer now always advances lastProcessedLevId regardless of whether an event's payload exists, and persists progress at batch end — preventing the indexer from looping forever on sparse levId ranges.
  • Eliminated per-event write transactions for skipped events: Replaced individual LMDB write+fsync operations for each filtered/errored event with a single batch-end persist, dramatically reducing I/O on relays with restrictive indexedKinds.
  • Always log batch progress: Changed logging from only reporting when indexed > 0 to always showing indexed=N skipped=M range=[start..end] head=H, so the indexer no longer appears stuck when filtering large batches.
  • Added duplicate-detection to prevent MDB_APPENDDUP conflicts: indexEventWithTxnHook now checks if an event was already indexed (by the on-write path) before inserting postings, avoiding MDB_KEYEXIST errors when the catch-up indexer re-encounters events the live writer already handled.

@leesalminen Bugs you reported in feature/nip-50-indexertweaks should now be resolved and is now merged into this branch (feature/nip-50)

dskvr added 7 commits April 28, 2026 01:14
Adds ISearchProvider interface, NoopSearchProvider stub, SearchProvider
factory (makeSearchProvider()), and KindMatcher for kind-range filtering.
RelayServer grows a unique_ptr<ISearchProvider> searchProvider field plus
searchIndexerThread/searchIndexerRunning members. RelayWebsocket advertises
NIP-50 in supported_nips when the provider is healthy. QueryScheduler gains
the searchProvider field to pass through to DBQuery.

Initialization order in cmd_relay.cpp ensures searchProvider is set before
websocket threads start, avoiding any data race.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Full LMDB-backed full-text search implementation. Tokenizer.h splits text
into normalized lowercase tokens (2-48 chars). LmdbSearchProvider.h
implements BM25 scoring over SearchIndex (DUPSORT inverted index with
packed levId:48/tf:16 postings) and SearchDocMeta (per-doc len+kind
metadata). Supports configurable candidate ranking strategies and
recency boost.

Uses lmdb::from_sv<uint64_t>/to_sv<uint64_t> throughout for
alignment-safe LMDB value reads/writes (MERGE-03: eliminates all
reinterpret_cast UB). Includes dup-detection guard to prevent
re-indexing already-indexed events and batch-process logging.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
SearchRunner.h integrates ISearchProvider with DBQuery for NIP-50 query
execution. RelayWriter passes searchProvider to writeEvents() and includes
a search indexing loop after write commit. RelayCron registers search
index cleanup hooks for event expiration. cmd_relay.cpp initializes the
search provider via makeSearchProvider() before websocket threads start,
starts/joins the searchIndexer background thread.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Adds three search maintenance commands (auto-discovered by golpe):
- search-reindex: catch-up indexer with checkpoint support, manual levId
  override, and batch-progress logging
- search-set-state: manually set lastIndexedLevId and indexVersion in
  SearchState table
- search-index-stats: report index size (token count, doc count, table sizes)

cmd_delete.cpp gains search index cleanup to remove indexed events on
manual deletion.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
golpe.yaml: add SearchIndex (DUPSORT inverted index), SearchDocMeta
(per-doc BM25 metadata, MDB_INTEGERKEY), and SearchState (index progress
tracking) tables.

src/apps/relay/golpe.yaml: add 13 relay__search__* config keys covering
enabled flag, backend, indexedKinds, BM25 ranking weights, candidate
ranking strategy, and query limits. Preserves all upstream additions:
relay__auth__enabled, relay__auth__serviceUrl, relay__maxTagsPerFilter,
relay__filterValidation__* block.

strfry.conf: add search { } config block after filterValidation block.

DBQuery.h: integrate SearchRunner and ISearchProvider; constructor accepts
optional searchProvider parameter; process() dispatches to searchRunner
when filter has search field and provider is healthy.

ActiveMonitors.h: add hasSearch() guard to skip search-only filters in
the monitor fast path.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
filters.h: add std::optional<std::string> search field and search key
parsing to NostrFilter constructor; add hasSearch() predicate. Preserves
upstream's try-catch wrapping for filter parse errors (MERGE-05), and
relay__maxTagsPerFilter config key replacing hardcoded limit.

events.h/events.cpp: merge writeEvents() signature to accept both
logLevel (uint64_t, replaces upstream's bool logDeletions) and
ISearchProvider *searchProvider. Preserves upstream's a-tag deletion
handling for kind-5 parameterized replaceable events (parseATag call,
replaceDeletion index). Adds searchProvider->deleteEvent() in the
deletion loop for search index consistency.

RelayReqWorker.cpp: pass searchProvider to QueryScheduler so DBQuery
receives it at construction time for search-aware query dispatch.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
Adds bench/scenarios/ (small.yml, medium.yml) for 10k and 1M event
benchmark scenarios, and bench/scripts/ (prepare.sh, run.sh, report.py,
sysinfo.sh) for reproducible benchmark runs with search enabled/disabled.
bench/SCENARIOS.md documents the methodology.

Co-Authored-By: sandwich <dskvr@users.noreply.github.com>
@dskvr
Copy link
Copy Markdown
Author

dskvr commented Apr 27, 2026

Correction to my prior comment: the squashed branch from earlier today silently regressed several pieces of upstream evolution (the rebase used a diff-apply approach that overwrote upstream's later changes in non-conflicting files). The PR has now been re-rebased using cherry-pick onto current master, which preserves upstream's evolution properly. New commit chain:

47d29e4 feat: add SearchProvider abstraction and NIP-50 relay integration
4eb2b4d feat: add LmdbSearchProvider with BM25 scoring
ee8d095 feat: wire NIP-50 search indexer and runner into relay
792e61d feat: add search dbutils commands
5df178d feat: add search LMDB schema and config plumbing
d74fc58 feat: add NIP-50 filter parsing and query path integration
c67cd48 chore: add bench scenarios and search benchmark scripts

Verification on the current branch:

Test Result
make -j16 clean, exit 0
perl test/writeTest.pl (full 408-line upstream version) 30/30 pass, including all 11 a-tag deletion sub-tests
perl test/filterFuzzTest.pl scan-limit 767 MATCH OK / 0 MISMATCH (45s)
perl test/filterFuzzTest.pl scan 500 MATCH OK / 0 MISMATCH (30s)
perl test/filterFuzzTest.pl monitor 202 MATCH OK / 0 MISMATCH (30s)
End-to-end NIP-50 search via nak --search returns expected hits, no false positives
NIP-11 supported_nips [1,2,4,9,11,22,28,40,45,50,70,77]

Apologies for the noise; the diff against master should now be just the NIP-50 surface plus the bench scripts.

@dskvr dskvr marked this pull request as ready for review April 27, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request: NIP-50 Support

3 participants