Skip to content

Comments

CASSANDRA-21174: Prevent log-incompatible upgrades#4614

Open
aparna0522 wants to merge 2 commits intoapache:trunkfrom
aparna0522:anaik-163631557
Open

CASSANDRA-21174: Prevent log-incompatible upgrades#4614
aparna0522 wants to merge 2 commits intoapache:trunkfrom
aparna0522:anaik-163631557

Conversation

@aparna0522
Copy link
Contributor

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

@aparna0522 aparna0522 marked this pull request as ready for review February 13, 2026 20:47
Copy link
Member

@krummas krummas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, but tests need a bit of work

the new tests can be unit tests instead, and we should have a test which tries to actually commit a register/startup with a lower version.

* WARNING: Only use this if you are certain the cluster metadata log contains no entries
* that require a newer serialization version. Misuse can lead to unreadable metadata.
*/
UNSAFE_SKIP_SERIALIZATION_VERSION_CHECK("cassandra.unsafe.skip_serialization_version_check", "false"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_metadata_serialization_version_check maybe?

try
{
// Register a ghost node with V0 to fake-force V0 serialization. In a real world cluster we will always be upgrading from a smaller version.
// Unregister to make directory empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of unregistering we should use the new CassandraRevelantProperties.UNSAFE_SKIP_SERIALIZATION_VERSION_CHECK to be able to register the old-versioned instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided to remove this property, is it safe to disregard this?
I am curious to know if you have thoughts to do this in some other way other than unregistering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't actually passing after these updates, for a slightly convoluted reason.
It is failing because it attempts to register a new node with an IP address which was previously used by another node and this is currently incompatible with some C* configuration (specifically Accord).

The first node in the cluster is added when the test starts and has the broadcast_address: 127.0.0.1. We then unregister this to force the directory into an empty state, which is new in this patch - previous versions of the test did not do this, instead they resorted to some other hackery to achieve the same aim of lowering the min common serialization version. At this point in the test, we commit a Register which uses the same endpoint address (127.0.0.1 via NodeAddresses.current()) and run into the problem with Accord: CASSANDRA-21026 Reusing the address of a removed node is not possible with Accord enabled. This leads to uncaught exceptions being thrown which automatically trigger the test to fail.

Fortunately, there is a choice of simple solutions to this

  • simply use a different address for the registration of the upgraded node
  • disable Accord via node config when creating the cluster. See o.a.c.distributed.test.ring.DecommissionTest::testAddressReuseAfterDecommission, which faces the same issue.

Personally, I would lean toward the former approach as when CASSANDRA-21026 is resolved the reason for disabling in this test won't be obvious.

This is all due to pretty obscure implementation details, so I wouldn't expect any of this to be at all obvious but the modified test is consistently failing, which should prompt further investigation.

@Test
public void testRegisterRejectsLowerSerializationVersion() throws Throwable
{
try (Cluster cluster = builder().withNodes(1).createWithoutStarting())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these new tests could be unit tests instead, see PrepareJoinTest for example

@aparna0522 aparna0522 force-pushed the anaik-163631557 branch 2 times, most recently from 0b4fbcf to 0e6ea5f Compare February 19, 2026 23:15
try
{
// Register a ghost node with V0 to fake-force V0 serialization. In a real world cluster we will always be upgrading from a smaller version.
// Unregister to make directory empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't actually passing after these updates, for a slightly convoluted reason.
It is failing because it attempts to register a new node with an IP address which was previously used by another node and this is currently incompatible with some C* configuration (specifically Accord).

The first node in the cluster is added when the test starts and has the broadcast_address: 127.0.0.1. We then unregister this to force the directory into an empty state, which is new in this patch - previous versions of the test did not do this, instead they resorted to some other hackery to achieve the same aim of lowering the min common serialization version. At this point in the test, we commit a Register which uses the same endpoint address (127.0.0.1 via NodeAddresses.current()) and run into the problem with Accord: CASSANDRA-21026 Reusing the address of a removed node is not possible with Accord enabled. This leads to uncaught exceptions being thrown which automatically trigger the test to fail.

Fortunately, there is a choice of simple solutions to this

  • simply use a different address for the registration of the upgraded node
  • disable Accord via node config when creating the cluster. See o.a.c.distributed.test.ring.DecommissionTest::testAddressReuseAfterDecommission, which faces the same issue.

Personally, I would lean toward the former approach as when CASSANDRA-21026 is resolved the reason for disabling in this test won't be obvious.

This is all due to pretty obscure implementation details, so I wouldn't expect any of this to be at all obvious but the modified test is consistently failing, which should prompt further investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants