Skip to content

bug: cloudtool websocket connect times out on startup #309

@flexus-teams

Description

@flexus-teams

Original Logs

20260413 04:38:40.568 ctool [INFO] run_cloudtool_service_real going down!
20260413 04:38:40.568 ctool [ERROR] 🛑 caught exception TimeoutError:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/gql/transport/common/adapters/websockets.py", line 71, in connect
    self.websocket = await websockets.connect(self.url, **connect_args)
  File "/usr/local/lib/python3.13/site-packages/websockets/asyncio/client.py", line 470, in create_connection
    _, connection = await loop.create_connection(factory, **kwargs)
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 1146, in create_connection
    sock = await self._connect_sock(
  File "/usr/local/lib/python3.13/asyncio/selector_events.py", line 645, in sock_connect
    return await fut
TimeoutError

Error Summary

Multiple cloudtool-related pods reported the same websocket connect timeout pattern. Affected services included cloudtool web, original, eds-setup, and remote-mcp-worker. The pods were otherwise Running, and the backend reports only scheduling pressure / delayed startup context, not an ongoing crash loop.

Stacktrace

/usr/local/lib/python3.13/site-packages/gql/transport/common/adapters/websockets.py:71 connect

Root Cause

  • File: flexus_client_kit/ckit_cloudtool.py:469-470
  • Function: run_cloudtool_service_real
  • Why: the service opens a websocket subscription to the backend and, under startup pressure, the connection attempt times out inside websockets.connect(...). The surrounding code catches the failure at the top-level service loop and retries, so this is an operational connectivity/startup issue rather than an unhandled code crash.
  • Git blame: Oleg Klimov in acffd604 / 1c9b39b8

Code Snippet

async with ws_client as ws:
    async for r in ws.subscribe(gql.gql(...)):
        ...

Affected

  • Pods: fservice-cloudtool-web, fservice-cloudtool-original, fservice-cloudtool-eds-setup, fservice-remote-mcp-worker
  • Namespace: flexus
  • Occurrences: multiple startup retries

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions