Skip to content

Compute worker - Improve status update and logs#2223

Open
Didayolo wants to merge 10 commits intodevelopfrom
fix-worker-logs
Open

Compute worker - Improve status update and logs#2223
Didayolo wants to merge 10 commits intodevelopfrom
fix-worker-logs

Conversation

@Didayolo
Copy link
Member

@Didayolo Didayolo commented Feb 28, 2026

Description

  1. Make _update_status robust to failures
  2. Log traceback instead of uninformative messages
  3. Change status to Failed for any exception, avoiding getting stuck to Running
  4. Add a "best effort" logging with push_logs method
  5. Add a container removal in case of failure, avoid orphan containers
  6. Some minor fixes
  7. Improve show_progress clarity and avoid overlogging
  8. Change in tasks.py: queue the compute worker task only after commit to avoid fetching too soon and raising a 404 error

Generally, this change should fix the "stuck in running" bug, and allow more logs to reach the platform's front-end. Possibly fix also the orphan containers issue.

Issues this PR resolves

A checklist for hand testing

  • Check successful and failing submissions
  • Check logs inside container and on the platform

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@Didayolo Didayolo mentioned this pull request Feb 28, 2026
16 tasks
@Didayolo
Copy link
Member Author

Didayolo commented Feb 28, 2026

CircleCI error:

E           AssertionError: Locator expected to be visible
E           Actual value: None
E           Error: element(s) not found 
E           Call log:
E             - Expect "to_be_visible" with timeout 2000ms
E             - waiting for get_by_role("cell", name="Finished")

test_submission.py:46: AssertionError
=========================== short test summary info ============================
FAILED test_submission.py::test_basic[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v15[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_irisV15_code[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_irisV15_result[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v18[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
============== 5 failed, 6 passed, 2 skipped in 220.01s (0:03:40) ==============

Exited with code exit status 1

When I try manually the E2E tests competitions and submissions, I do have Finished state though.

Some logs in the artefacts:

django-1          | �[32m2026-02-28 05:06:29.206�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/163/�[0m
django-1          | �[32m2026-02-28 05:06:29.207�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/163/�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.208�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m545�[0m - �[31m�[1mSubmission patch failed with status = 404, and response = 
compute_worker-1  | b'{"detail":"No Submission matches the given query."}'�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.208�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_status�[0m:�[36m561�[0m - �[31m�[1mFailed to update submission status to Failed: Failure updating submission data.�[0m
compute_worker-1  | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker-1  | 
compute_worker-1  |   File "�[32m/app/�[0m�[32m�[1mcompute_worker.py�[0m", line �[33m683�[0m, in �[35m_get_bundle�[0m
compute_worker-1  |     �[35m�[1mwith�[0m �[1mZipFile�[0m�[1m(�[0m�[1mbundle_file�[0m�[1m,�[0m �[36m"r"�[0m�[1m)�[0m �[35m�[1mas�[0m �[1mz�[0m�[1m:�[0m
compute_worker-1  |     �[36m     │       └ �[0m�[36m�[1m'/codabench/uPK-421_sID-163__pqs91qp9/bundles/tmpz9en08b8'�[0m
compute_worker-1  |     �[36m     └ �[0m�[36m�[1m<class 'zipfile.ZipFile'>�[0m
compute_worker-1  | 
compute_worker-1  |   File "/root/.local/share/uv/python/cpython-3.9.20-linux-x86_64-gnu/lib/python3.9/zipfile.py", line 1268, in __init__
compute_worker-1  |     self._RealGetContents()
compute_worker-1  |     │    └ <function ZipFile._RealGetContents at 0x7e44bca6daf0>
compute_worker-1  |<zipfile.ZipFile [closed]>
compute_worker-1  |   File "/root/.local/share/uv/python/cpython-3.9.20-linux-x86_64-gnu/lib/python3.9/zipfile.py", line 1335, in _RealGetContents
compute_worker-1  |     raise BadZipFile("File is not a zip file")
compute_worker-1  |<class 'zipfile.BadZipFile'>
0m
compute_worker-1  | �[32m2026-02-28 05:06:29.220�[0m | �[1mINFO    �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m539�[0m - �[1mUpdating submission @ http://django:8000/api/submissions/164/ with data = {'status': 'Running', 'status_details': 'ingestion_hostname-local_worker', 'secret': '43efdd09-8cae-4c48-a67b-1528667d0123'}�[0m
django-1          | �[32m2026-02-28 05:06:29.237�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/164/�[0m
django-1          | �[32m2026-02-28 05:06:29.237�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/164/�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.238�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m545�[0m - �[31m�[1mSubmission patch failed with status = 404, and response = 
compute_worker-1  | b'{"detail":"No Submission matches the given query."}'�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.238�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_status�[0m:�[36m561�[0m - �[31m�[1mFailed to update submission status to Running: Failure updating submission data.�[0m
compute_worker-1  | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker-1  | 
compute_worker-1  |   File "/.venv/bin/celery", line 10, in <module>
compute_worker-1  |     sys.exit(main())
compute_worker-1  |     │   │    └ <function main at 0x7e44bcf471f0>
compute_worker-1  |     │   └ <function Worker.__call__.<locals>.exit at 0x7e44bb2dda60>
compute_worker-1  |<module 'sys' (built-in)>
compute_worker-1  |   File "/.venv/lib/python3.9/site-packages/celery/__main__.py", line 15, in main
compute_worker-1  |     sys.exit(_main())
compute_worker-1  |     │   │    └ <function main at 0x7e44bbac69d0>
compute_worker-1  |     │   └ <function Worker.__call__.<locals>.exit at 0x7e44bb2dda60>
compute_worker-1  |<module 'sys' (built-in)>
compute_worker-1  |   File "/.venv/lib/python3.9/site-packages/celery/bin/celery.py", line 213, in main
compute_worker-1  |     return celery(auto_envvar_prefix="CELERY")
compute_worker-1  |<DYMGroup celery>
404 not Found
test-failed-1

@Didayolo Didayolo marked this pull request as draft March 3, 2026 12:59
@Didayolo Didayolo added the P1 High priority label Mar 3, 2026
@Didayolo
Copy link
Member Author

Didayolo commented Mar 4, 2026

CircleCI:

=========================== short test summary info ============================
FAILED src/apps/competitions/tests/test_submissions.py::MultipleTasksPerPhaseTests::test_children_always_created_in_the_same_order - AssertionError: assert 0 == 2
FAILED src/apps/competitions/tests/test_submissions.py::MultipleTasksPerPhaseTests::test_making_submission_creates_parent_sub_and_additional_sub_per_task - AssertionError: assert 0 == 2
FAILED src/apps/competitions/tests/test_submissions.py::MultipleTasksPerPhaseTests::test_making_submission_to_phase_with_one_task_does_not_create_parents_or_children - AssertionError: assert 0 == 1
====== 3 failed, 229 passed, 2 skipped, 106 warnings in 231.68s (0:03:51) ======

But locally it seems to work fine (even multitask)

@Didayolo Didayolo marked this pull request as ready for review March 4, 2026 09:03
@Didayolo
Copy link
Member Author

Didayolo commented Mar 4, 2026

@ObadaS @ihsaan-ullah It is ready to review

@Didayolo Didayolo mentioned this pull request Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant