Skip to content

Nina autotune timeout#38

Merged
NinaCai merged 3 commits into
mainfrom
nina-autotune-timeout
May 12, 2026
Merged

Nina autotune timeout#38
NinaCai merged 3 commits into
mainfrom
nina-autotune-timeout

Conversation

@NinaCai
Copy link
Copy Markdown
Collaborator

@NinaCai NinaCai commented May 11, 2026

Change autotune timeout to 1.5 hours.
Change per kernel run timeout to 300s.

@NinaCai NinaCai requested a review from shangkunwang01 May 11, 2026 21:46
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 11, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Comment thread MaxKernel/hitl_agent/server_utils/eval_server.py Outdated
code_template: str
search_space: dict[str, list]
timeout: Optional[int] = 30
timeout: Optional[int] = 300
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the tpu server is running and the eval server is time out when there are many combination in your grid search. Should we make this timeout dynamic (= autotune_time_out/#combination?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic should be when eval_server is timeout, just kill all processes in grid search. This timeout should be roughly how long a kernel runs, and it doesn't need to be associated with total timeout in eval_server.

Copy link
Copy Markdown
Collaborator

@shangkunwang01 shangkunwang01 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the best but I doubt the current implementation can achieve this.
When eval_server timed out, the tpu_server will not automatically shut down the grid search. That's why I want to make each grid_search timeout to be at most equal to autotune_time_out/#combination.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this will be added in the next PR. Subprocess hanging is a bug to all eval types not just autotune. Dynamic timeout would be worse if there are too many combinations and each process have very short timeout. Then none of these process can actually show any result.

@NinaCai NinaCai requested a review from shangkunwang01 May 11, 2026 23:09
@shangkunwang01 shangkunwang01 dismissed their stale review May 12, 2026 01:33

No change needs to be made.

@NinaCai NinaCai merged commit 5b25e00 into main May 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants