Stream backup archive directly to S3 to reduce ephemeral storage#4263
Stream backup archive directly to S3 to reduce ephemeral storage#4263
Conversation
Agent-Logs-Url: https://github.com/sillsdev/TheCombine/sessions/aa2742e4-8a85-48c9-82a2-3d3d7ee75063 Co-authored-by: imnasnainaec <6411521+imnasnainaec@users.noreply.github.com>
|
The streaming approach (push_stream) starts the S3 upload before the tarball is fully created. If any step fails mid-stream, a partial backup gets silently uploaded to S3. Two possible approaches to fix this: Add cleanup on failure: In the except/finally handling, if an error occurs during tar creation, delete the partial S3 object. You could call aws s3 rm on the S3 URI after the upload process completes (or is killed). This would require adding a delete/rm method to AwsBackup and wrapping the streaming section in a try/except that catches both Exception and SystemExit. Kill the upload process on failure: Instead of letting the aws process complete the partial upload, kill it (upload_proc.kill()) in the error path. For multipart uploads, AWS automatically cleans up incomplete multipart uploads (or an S3 lifecycle policy can do this). However, for small uploads that complete as a single PUT, the object might still appear. The most robust approach is a combination: kill the upload process on failure AND then attempt to delete the S3 object. The cleanup code should be in a finally/except block that catches SystemExit (raised by sys.exit in run_cmd) in addition to regular exceptions. Relevant files: combine_backup.py (main function around lines 97-130), aws_backup.py (push_stream method, would need a delete method). |
Agent-Logs-Url: https://github.com/sillsdev/TheCombine/sessions/ea2b91ff-fe07-4dcc-bca0-8bc9ef35f46b Co-authored-by: imnasnainaec <6411521+imnasnainaec@users.noreply.github.com>
Implemented the combined approach in c4d01a2:
|
Devin review: https://app.devin.ai/review/sillsdev/TheCombine/pull/4263
The daily backup job was consuming ~14 GB of node ephemeral storage because the tar.gz archive was written to local disk in full before being uploaded to S3 — making peak usage roughly
raw_files + archive_size.Changes
aws_backup.py: Addpush_stream(dest)— launchesaws s3 cp - s3://...as a subprocess and returns thePopenhandle for callers to stream data viastdin. Adddelete(dest)— removes an S3 object (usescheck_results=Falseso a missing object does not raise an error).combine_backup.py: Replace localtarfile.open(file, "x:gz")+aws.push(file)withtarfile.open(fileobj=upload_proc.stdin, mode="w:gz"), piping the archive directly to S3 as it is built. The archive is never materialized on disk. On any failure during tar creation (ExceptionorSystemExit), the upload process is killed and the partial S3 object is deleted. A warning is logged if the cleanup deletion itself fails. The same delete-and-warn logic applies if the upload process exits with a non-zero return code.Peak ephemeral storage drops from roughly
max(db_dump, backend_files) + archive_sizeto justmax(db_dump, backend_files).This change is
This change is