Skip to content

Conversation

@vcarl
Copy link
Member

@vcarl vcarl commented Jan 28, 2026

No description provided.

@what-the-diff
Copy link

what-the-diff bot commented Jan 28, 2026

PR Summary

  • Addition of db-backup.sh script
    This new script has been introduced to help with the backup process of the production database from Kubernetes pods to your local destination. In addition to this, it also enables copying of WAL (Write-Ahead Logging) and SHM (Shared Memory) files, while performing a quick check to ensure the integrity of the backup. Essentially, this script enhances the security of your data by providing an additional layer of backup.

  • Addition of db-integrity.sh script
    Another new script that has been created primarily for ensuring database integrity. Running several checks on the local SQLite database, this improves the efficiency of your workflow by reporting the overall health status of your database, verifying foreign key relations, and keeping track of the number of rows in your tables.

  • Addition of db-rebuild.sh script
    This script is designed to rectify and rebuild a SQLite database, should it encounter any corruption or damage. It attempts to recover the corrupted data or falls back onto the .dump function if the recovery fails. As a bonus, it provides a count comparison between the initial and rebuilt databases and performs an integrity check to ensure the rebuilt database's accuracy. Basically, this script is important as a recovery tool for your database.

@github-actions
Copy link

github-actions bot commented Jan 28, 2026

Preview environment removed

The preview for this PR has been cleaned up.

vcarl and others added 2 commits January 28, 2026 12:08
Integrity checks now run remotely against the live pod via better-sqlite3
(no downtime). Backups use the .backup() API for consistent snapshots
without WAL/SHM copying. Recovery is a single pipeline that rebuilds
directly on the PVC volume, avoiding slow network transfers of the full DB.

- db-common.sh: shared constants and utilities
- db-integrity.sh: remote checks via kubectl exec + node -e (readonly)
- db-backup.sh: consistent backup via better-sqlite3 .backup() API
- db-recover.sh: full pipeline replacing db-rebuild.sh + db-deploy.sh
- Fixes wrong PVC name (was data-mod-bot-set-0, now mod-bot-pvc-mod-bot-set-0)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vcarl
Copy link
Member Author

vcarl commented Jan 28, 2026

Example output:

$ ./scripts/db-integrity.sh
Database Integrity Report (Remote)
Pod: mod-bot-set-0
integrity check
    $ ./scripts/db-integrity.sh
    Database Integrity Report (Remote)
    Pod: mod-bot-set-0
    Database: /data/mod-bot.sqlite3
    Date: Wed Jan 28 13:30:00 EST 2026
1. PRAGMA quick_check
   Status: FAILED
   Details:
   *** in database main ***
Tree 33 page 33 right child: Bad ptr map entry key=46402 expected=(5,33) got=(5,46040)
Tree 33 page 33 right child: Bad ptr map entry key=46404 expected=(5,33) got=(5,46040)
Tree 33 page 46404 cell 27: Rowid 1208388 out of order
Tree 37 page 37 right child: Bad ptr map entry key=46417 expected=(5,37) got=(5,46040)
Tree 37 page 37 right child: Bad ptr map entry key=46403 expected=(5,37) got=(5,45873)
Tree 37 page 37 right child: Bad ptr map entry key=46436 expected=(5,37) got=(5,46040)
Tree 37 page 46436 cell 31: Rowid 1209119 out of order
Tree 37 page 37 right child: Bad ptr map entry key=46405 expected=(5,37) got=(5,46040)
Tree 13 page 46040 right child: 2nd reference to page 46436
Tree 13 page 46040 right child: 2nd reference to page 46417
Tree 13 page 46040 right child: 2nd reference to page 46405
Tree 13 page 46040 right child: 2nd reference to page 46404
Tree 13 page 46040 right child: 2nd reference to page 46402
Tree 14 page 45873 right child: 2nd reference to page 46403
   wrong # of entries in index user_threads_pk
   wrong # of entries in index idx_reported_user_guild
   wrong # of entries in index sqlite_autoindex_message_stats_1

2. PRAGMA foreign_key_check
   Status: PASSED (no violations)

3. Table Row Counts
   channel_info                  163
   escalation_records             19
   escalations                    10
   guild_subscriptions             0
   guilds                          8
   honeypot_config                 1
   kysely_migration               18
   kysely_migration_lock           1
   message_stats             1209455
   reactji_channeler_config        2
   reported_messages            2627
   sessions                       23
   tickets_config                  9
   user_threads                ERROR
   users                           6
   ---                           ---
   TOTAL                     1212342

4. Database Configuration
   Journal mode:   wal
   Page size:      4096
   Page count:     46452
   Freelist count: 0

5. Overall Health Status
   Status: ISSUES DETECTED
   Review the findings above. Consider running:
     ./scripts/db-recover.sh
recovery
    $ ./scripts/db-recover.sh
    Database Recovery Pipeline
    Date: 2026-01-28_133435
This script will:
  - Create a recovery pod attached to the data volume
  - Scale down production (bot will be offline)
  - Rebuild the database on the volume
  - Scale production back up

Continue? [y/N] y

=== Step 1: Creating recovery pod (will stay Pending until PVC is free) ===
pod/db-recovery-temp created
Recovery pod created (Pending)

=== Step 2: Scaling down mod-bot-set ===
Current replicas: 1
statefulset.apps/mod-bot-set scaled
Waiting for pod to terminate...
pod/mod-bot-set-0 condition met
StatefulSet scaled down

=== Step 3: Waiting for recovery pod to become Ready ===
pod/db-recovery-temp condition met
Recovery pod is Ready

=== Step 4: Installing sqlite3 on recovery pod ===
OK: 10.6 MiB in 20 packages
sqlite3 installed

=== Step 5: Backing up corrupt files on volume ===
total 182M
-rw-------    1 root     root      181.5M Jan 28 18:34 mod-bot.sqlite3
-rw-------    1 root     root       32.0K Jan 28 18:34 mod-bot.sqlite3-shm
-rw-------    1 root     root           0 Jan 28 18:34 mod-bot.sqlite3-wal
Corrupt files backed up to /data/corrupt-bak-2026-01-28_133435

=== Step 6: Attempting WAL checkpoint (best-effort) ===
0|0|0
WAL checkpoint succeeded

=== Step 7: Checking database integrity on volume ===
*** in database main ***
Tree 33 page 33 right child: Bad ptr map entry key=46402 expected=(5,33) got=(5,46040)
Tree 33 page 33 right child: Bad ptr map entry key=46404 expected=(5,33) got=(5,46040)
Tree 33 page 46404 cell 27: Rowid 1208388 out of order
Tree 37 page 37 right child: Bad ptr map entry key=46417 expected=(5,37) got=(5,46040)

=== Step 8: Rebuilding database on volume ===
Using .recover (preferred for corrupt databases)...
          defensive off
Rebuild via .recover succeeded

=== Step 9: Verifying rebuilt database ===
  Integrity check: PASSED
  Foreign key check: PASSED

=== Step 10: Comparing row counts (corrupt vs rebuilt) ===
Table                             Corrupt    Rebuilt       Diff
-----                             -------    -------       ----
channel_info                          163        163          0
escalation_records                     19         19          0
escalations                            10         10          0
guild_subscriptions                     0          0          0
guilds                                  8          8          0
honeypot_config                         1          1          0
kysely_migration                       18         18          0
kysely_migration_lock                   1          1          0
message_stats                     1209455    1209455          0
reactji_channeler_config                2          2          0
reported_messages                    2627       2627          0
sessions                               23         23          0
tickets_config                          9          9          0
user_threads                          ERR       1419        N/A
users                                   6          6          0
-----                             -------    -------       ----
TOTAL                             1212342    1213761       1419

=== Step 11: Confirm deployment ===

  Rebuild method:  .recover
  Corrupt DB size: 181.5M
  Rebuilt DB size: 188.0M
  Corrupt backup:  /data/corrupt-bak-2026-01-28_133435

This will replace the production database with the rebuilt copy.
Deploy rebuilt database? [y/N] y

=== Step 12: Swapping rebuilt database into place ===
Database swapped

=== Step 13: Scaling up mod-bot-set ===
statefulset.apps/mod-bot-set scaled
StatefulSet scaling up to 1 replicas

=== Step 14: Waiting for pod readiness ===
pod/mod-bot-set-0 condition met
Pod mod-bot-set-0 is Ready

=== Step 15: Verifying deployment ===
Integrity check: PASSED
Foreign key check: PASSED
Total rows: 1213742

=== Step 16: Cleaning up recovery pod ===
pod "db-recovery-temp" deleted
Recovery pod deleted

=== Recovery Complete ===
The database has been rebuilt and deployed successfully.
Corrupt backup preserved at: /data/corrupt-bak-2026-01-28_133435 (on the volume)

@vcarl vcarl merged commit fb30ce4 into main Jan 28, 2026
5 checks passed
@vcarl vcarl deleted the vc-db-recovery-fix branch January 28, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants