CASSANDRA-21174: Prevent log-incompatible upgrades#4614
CASSANDRA-21174: Prevent log-incompatible upgrades#4614aparna0522 wants to merge 2 commits intoapache:trunkfrom
Conversation
fc7e48d to
b875077
Compare
krummas
left a comment
There was a problem hiding this comment.
code looks good, but tests need a bit of work
the new tests can be unit tests instead, and we should have a test which tries to actually commit a register/startup with a lower version.
| * WARNING: Only use this if you are certain the cluster metadata log contains no entries | ||
| * that require a newer serialization version. Misuse can lead to unreadable metadata. | ||
| */ | ||
| UNSAFE_SKIP_SERIALIZATION_VERSION_CHECK("cassandra.unsafe.skip_serialization_version_check", "false"), |
There was a problem hiding this comment.
skip_metadata_serialization_version_check maybe?
| try | ||
| { | ||
| // Register a ghost node with V0 to fake-force V0 serialization. In a real world cluster we will always be upgrading from a smaller version. | ||
| // Unregister to make directory empty |
There was a problem hiding this comment.
instead of unregistering we should use the new CassandraRevelantProperties.UNSAFE_SKIP_SERIALIZATION_VERSION_CHECK to be able to register the old-versioned instance
There was a problem hiding this comment.
Since we decided to remove this property, is it safe to disregard this?
I am curious to know if you have thoughts to do this in some other way other than unregistering?
There was a problem hiding this comment.
This test isn't actually passing after these updates, for a slightly convoluted reason.
It is failing because it attempts to register a new node with an IP address which was previously used by another node and this is currently incompatible with some C* configuration (specifically Accord).
The first node in the cluster is added when the test starts and has the broadcast_address: 127.0.0.1. We then unregister this to force the directory into an empty state, which is new in this patch - previous versions of the test did not do this, instead they resorted to some other hackery to achieve the same aim of lowering the min common serialization version. At this point in the test, we commit a Register which uses the same endpoint address (127.0.0.1 via NodeAddresses.current()) and run into the problem with Accord: CASSANDRA-21026 Reusing the address of a removed node is not possible with Accord enabled. This leads to uncaught exceptions being thrown which automatically trigger the test to fail.
Fortunately, there is a choice of simple solutions to this
- simply use a different address for the registration of the
upgradednode - disable Accord via node config when creating the cluster. See
o.a.c.distributed.test.ring.DecommissionTest::testAddressReuseAfterDecommission, which faces the same issue.
Personally, I would lean toward the former approach as when CASSANDRA-21026 is resolved the reason for disabling in this test won't be obvious.
This is all due to pretty obscure implementation details, so I wouldn't expect any of this to be at all obvious but the modified test is consistently failing, which should prompt further investigation.
| @Test | ||
| public void testRegisterRejectsLowerSerializationVersion() throws Throwable | ||
| { | ||
| try (Cluster cluster = builder().withNodes(1).createWithoutStarting()) |
There was a problem hiding this comment.
It looks like these new tests could be unit tests instead, see PrepareJoinTest for example
0b4fbcf to
0e6ea5f
Compare
0e6ea5f to
b651a23
Compare
| try | ||
| { | ||
| // Register a ghost node with V0 to fake-force V0 serialization. In a real world cluster we will always be upgrading from a smaller version. | ||
| // Unregister to make directory empty |
There was a problem hiding this comment.
This test isn't actually passing after these updates, for a slightly convoluted reason.
It is failing because it attempts to register a new node with an IP address which was previously used by another node and this is currently incompatible with some C* configuration (specifically Accord).
The first node in the cluster is added when the test starts and has the broadcast_address: 127.0.0.1. We then unregister this to force the directory into an empty state, which is new in this patch - previous versions of the test did not do this, instead they resorted to some other hackery to achieve the same aim of lowering the min common serialization version. At this point in the test, we commit a Register which uses the same endpoint address (127.0.0.1 via NodeAddresses.current()) and run into the problem with Accord: CASSANDRA-21026 Reusing the address of a removed node is not possible with Accord enabled. This leads to uncaught exceptions being thrown which automatically trigger the test to fail.
Fortunately, there is a choice of simple solutions to this
- simply use a different address for the registration of the
upgradednode - disable Accord via node config when creating the cluster. See
o.a.c.distributed.test.ring.DecommissionTest::testAddressReuseAfterDecommission, which faces the same issue.
Personally, I would lean toward the former approach as when CASSANDRA-21026 is resolved the reason for disabling in this test won't be obvious.
This is all due to pretty obscure implementation details, so I wouldn't expect any of this to be at all obvious but the modified test is consistently failing, which should prompt further investigation.
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira