Skip to content

Stream backup archive directly to S3 to reduce ephemeral storage#4263

Draft
Copilot wants to merge 3 commits intomasterfrom
copilot/reduce-ephemeral-storage-backup-job
Draft

Stream backup archive directly to S3 to reduce ephemeral storage#4263
Copilot wants to merge 3 commits intomasterfrom
copilot/reduce-ephemeral-storage-backup-job

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 10, 2026

Devin review: https://app.devin.ai/review/sillsdev/TheCombine/pull/4263

The daily backup job was consuming ~14 GB of node ephemeral storage because the tar.gz archive was written to local disk in full before being uploaded to S3 — making peak usage roughly raw_files + archive_size.

Changes

  • aws_backup.py: Add push_stream(dest) — launches aws s3 cp - s3://... as a subprocess and returns the Popen handle for callers to stream data via stdin. Add delete(dest) — removes an S3 object (uses check_results=False so a missing object does not raise an error).
  • combine_backup.py: Replace local tarfile.open(file, "x:gz") + aws.push(file) with tarfile.open(fileobj=upload_proc.stdin, mode="w:gz"), piping the archive directly to S3 as it is built. The archive is never materialized on disk. On any failure during tar creation (Exception or SystemExit), the upload process is killed and the partial S3 object is deleted. A warning is logged if the cleanup deletion itself fails. The same delete-and-warn logic applies if the upload process exits with a non-zero return code.
# Before: write full archive to disk, then upload
with tarfile.open(backup_file, "x:gz") as tar:
    tar.add(...)
aws.push(backup_file, aws_file)

# After: stream archive directly to S3, with cleanup on failure
upload_proc = aws.push_stream(aws_file)
try:
    with tarfile.open(fileobj=upload_proc.stdin, mode="w:gz") as tar:
        tar.add(...)
except (Exception, SystemExit):
    upload_proc.kill()
    upload_proc.wait()
    aws.delete(aws_file)
    raise

Peak ephemeral storage drops from roughly max(db_dump, backend_files) + archive_size to just max(db_dump, backend_files).


This change is


This change is Reviewable

Copilot AI changed the title [WIP] Reduce ephemeral storage used by backup job Stream backup archive directly to S3 to reduce ephemeral storage Apr 10, 2026
Copilot AI requested a review from imnasnainaec April 10, 2026 21:41
@imnasnainaec
Copy link
Copy Markdown
Collaborator

@copilot

The streaming approach (push_stream) starts the S3 upload before the tarball is fully created. If any step fails mid-stream, a partial backup gets silently uploaded to S3. Two possible approaches to fix this:

Add cleanup on failure: In the except/finally handling, if an error occurs during tar creation, delete the partial S3 object. You could call aws s3 rm on the S3 URI after the upload process completes (or is killed). This would require adding a delete/rm method to AwsBackup and wrapping the streaming section in a try/except that catches both Exception and SystemExit.

Kill the upload process on failure: Instead of letting the aws process complete the partial upload, kill it (upload_proc.kill()) in the error path. For multipart uploads, AWS automatically cleans up incomplete multipart uploads (or an S3 lifecycle policy can do this). However, for small uploads that complete as a single PUT, the object might still appear.

The most robust approach is a combination: kill the upload process on failure AND then attempt to delete the S3 object. The cleanup code should be in a finally/except block that catches SystemExit (raised by sys.exit in run_cmd) in addition to regular exceptions.

Relevant files: combine_backup.py (main function around lines 97-130), aws_backup.py (push_stream method, would need a delete method).

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 23, 2026

@copilot

The streaming approach (push_stream) starts the S3 upload before the tarball is fully created. If any step fails mid-stream, a partial backup gets silently uploaded to S3. Two possible ap...

Implemented the combined approach in c4d01a2:

  • Added AwsBackup.delete() in aws_backup.py (uses check_results=False so a missing object doesn't raise).
  • Wrapped the streaming section in except (Exception, SystemExit) — on any failure: kills the upload process (upload_proc.kill() + wait()), then calls aws.delete(aws_file) to remove any partial S3 object, with a warning logged if the deletion itself fails. The original exception is then re-raised.
  • Also calls aws.delete + logs a warning if the upload process exits with a non-zero code after the tar finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce ephemeral storage used by backup job

2 participants