-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Milestone
Description
After a failed join-cluster attempt the worker node is offline. A second join attempt fails with "This node was already added to the cluster", but the Nodes page does not show it and it is not possible to remove it from UI.
Steps to reproduce
A temporary name resolution error occurred during the join and join-cluster action aborted.
This could be a procedure to reproduce it:
- Manually run add-node action
Expected behavior
After add-node the node must be visible in Nodes page, so I can remove it for whatever reason.
Actual behavior
The Nodes page shows an "offline node" banner, but does not allow me to recover from the error.
In worker node journal:
Mar 13 07:21:52 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: join-cluster/00validate_cluster is starting
Mar 13 07:21:53 rl2 agent@cluster[34797]: /usr/local/agent/pyenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'rl1.dp.neths>
Mar 13 07:21:53 rl2 agent@cluster[34797]: warnings.warn(
Mar 13 07:21:53 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: join-cluster/50update is starting
Mar 13 07:21:54 rl2 agent@cluster[34797]: Leader response is successful: the new node ID is node/2!
Mar 13 07:21:54 rl2 agent@cluster[34797]: leader_endpoint error: [Errno -2] Name or service not known DATA {'ip_address': '10.5.4.2', 'leader_core_version': '3.18.0', 'leader_endpoint': 'rl1.dp.nethserver.net:558>
Mar 13 07:21:54 rl2 agent@cluster[34797]: After the issue is solved, remove node 2 before running a new join attempt.
Mar 13 07:21:54 rl2 agent@cluster[34797]: Traceback (most recent call last):
Mar 13 07:21:54 rl2 agent@cluster[34797]: File "/var/lib/nethserver/cluster/actions/join-cluster/50update", line 110, in <module>
Mar 13 07:21:54 rl2 agent@cluster[34797]: socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
Mar 13 07:21:54 rl2 agent@cluster[34797]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 13 07:21:54 rl2 agent@cluster[34797]: File "/usr/lib64/python3.11/socket.py", line 974, in getaddrinfo
Mar 13 07:21:54 rl2 agent@cluster[34797]: for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Mar 13 07:21:54 rl2 agent@cluster[34797]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 13 07:21:54 rl2 agent@cluster[34797]: socket.gaierror: [Errno -2] Name or service not known
Mar 13 07:21:54 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: action "join-cluster" status is "aborted" (1) at step 50update
Components
- Core 3.18.0
See also
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
ToDo