Skip to content

Offline node not listed and not removable #7918

@DavidePrincipi

Description

@DavidePrincipi

After a failed join-cluster attempt the worker node is offline. A second join attempt fails with "This node was already added to the cluster", but the Nodes page does not show it and it is not possible to remove it from UI.

Steps to reproduce

A temporary name resolution error occurred during the join and join-cluster action aborted.

This could be a procedure to reproduce it:

  • Manually run add-node action

Expected behavior

After add-node the node must be visible in Nodes page, so I can remove it for whatever reason.

Actual behavior

The Nodes page shows an "offline node" banner, but does not allow me to recover from the error.

Image

In worker node journal:

Mar 13 07:21:52 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: join-cluster/00validate_cluster is starting
Mar 13 07:21:53 rl2 agent@cluster[34797]: /usr/local/agent/pyenv/lib64/python3.11/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'rl1.dp.neths>
Mar 13 07:21:53 rl2 agent@cluster[34797]:   warnings.warn(
Mar 13 07:21:53 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: join-cluster/50update is starting
Mar 13 07:21:54 rl2 agent@cluster[34797]: Leader response is successful: the new node ID is node/2!
Mar 13 07:21:54 rl2 agent@cluster[34797]: leader_endpoint error: [Errno -2] Name or service not known DATA {'ip_address': '10.5.4.2', 'leader_core_version': '3.18.0', 'leader_endpoint': 'rl1.dp.nethserver.net:558>
Mar 13 07:21:54 rl2 agent@cluster[34797]: After the issue is solved, remove node 2 before running a new join attempt.
Mar 13 07:21:54 rl2 agent@cluster[34797]: Traceback (most recent call last):
Mar 13 07:21:54 rl2 agent@cluster[34797]:   File "/var/lib/nethserver/cluster/actions/join-cluster/50update", line 110, in <module>
Mar 13 07:21:54 rl2 agent@cluster[34797]:     socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
Mar 13 07:21:54 rl2 agent@cluster[34797]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 13 07:21:54 rl2 agent@cluster[34797]:   File "/usr/lib64/python3.11/socket.py", line 974, in getaddrinfo
Mar 13 07:21:54 rl2 agent@cluster[34797]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Mar 13 07:21:54 rl2 agent@cluster[34797]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 13 07:21:54 rl2 agent@cluster[34797]: socket.gaierror: [Errno -2] Name or service not known
Mar 13 07:21:54 rl2 agent@cluster[34797]: task/cluster/b75455d4-304f-40c3-a739-5801d8be2aee: action "join-cluster" status is "aborted" (1) at step 50update

Components

  • Core 3.18.0

See also

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

ToDo

Relationships

None yet

Development

No branches or pull requests

Issue actions