HDDS-10306. Speed up TestSnapshotBackgroundServices#9721
HDDS-10306. Speed up TestSnapshotBackgroundServices#9721hevinhsu wants to merge 9 commits intoapache:masterfrom
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @hevinhsu for the patch.
...n-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR speeds up the TestSnapshotBackgroundServices integration test by reusing a single shared MiniOzoneHACluster across test cases and adding a helper API to restart OMs with updated configuration, avoiding full cluster re-creation.
Changes:
- Refactored
TestSnapshotBackgroundServicesto use a per-class shared HA cluster and per-test recovery/setup. - Added
MiniOzoneHAClusterImpl#getInactiveOM()and a helper to restart OMs with a config customizer. - Adjusted OM restart behavior to ensure an inactive OM is re-activated before restart.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
hadoop-ozone/mini-cluster/src/main/java/org/apache/hadoop/ozone/MiniOzoneHAClusterImpl.java |
Adds inactive OM access and a restart-with-config helper to support faster test reconfiguration. |
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java |
Reworks the test lifecycle to reuse the HA cluster and selectively restart/recover OMs between test cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (OzoneManager om : toRestart) { | ||
| if (!om.stop()) { | ||
| continue; | ||
| } | ||
| om.join(); | ||
| om.restart(); | ||
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); | ||
| } |
There was a problem hiding this comment.
restartOzoneManagersWithConfigCustomizer silently skips restarting an OM when om.stop() returns false (even though the OM was in the toRestart list). This can mask a failed stop and leave the cluster partially restarted with a mix of old/new config. Consider failing fast (throw) or at least logging and still attempting a restart / waiting for a consistent state.
| @BeforeEach | ||
| public void setupTest() throws IOException, InterruptedException, TimeoutException { | ||
| recoverCluster(); | ||
| stopFollowerOM(cluster.getOMLeader()); |
There was a problem hiding this comment.
setupTest passes cluster.getOMLeader() into stopFollowerOM(...), but getOMLeader() can legally return null transiently (eg during leader transitions) and would cause the helper to stop an arbitrary OM. Prefer using the result of cluster.waitForLeaderOM() (or assert non-null) to avoid test flakiness.
| stopFollowerOM(cluster.getOMLeader()); | |
| OzoneManager leader = cluster.waitForLeaderOM(); | |
| stopFollowerOM(leader); |
...n-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java
Show resolved
Hide resolved
| assertNotNull(files); | ||
| int numberOfSstFiles = files.length; | ||
|
|
||
| assertEquals(cluster.getOMLeader(), newLeaderOM); |
There was a problem hiding this comment.
assertEquals(cluster.getOMLeader(), newLeaderOM) can be flaky because getOMLeader() explicitly returns null when no leader is ready or when multiple leaders appear ready. If the intent is to verify leadership, use cluster.waitForLeaderOM() (and compare), or assert newLeaderOM.isLeaderReady().
| assertEquals(cluster.getOMLeader(), newLeaderOM); | |
| assertEquals(cluster.waitForLeaderOM(), newLeaderOM); |
| } | ||
| om.join(); | ||
| om.restart(); | ||
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); |
There was a problem hiding this comment.
Access of element annotated with VisibleForTesting found in production code.
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); | |
| long startTime = System.currentTimeMillis(); | |
| while (!om.isRunning()) { | |
| if (System.currentTimeMillis() - startTime > 30000) { | |
| throw new TimeoutException("Timed out waiting for OzoneManager to start."); | |
| } | |
| Thread.sleep(1000); | |
| } |
What changes were proposed in this pull request?
Speed up
TestSnapshotBackgroundServicesby reusing a sharedMiniOzoneHAClusterinstead of creating a new cluster for each test case.Please describe your PR in detail:
TestSnapshotBackgroundServicesto use a sharedMiniOzoneHAClusterto reduce repeated cluster initialization cost.What is the link to the Apache JIRA?
https://issues.apache.org/jira/browse/HDDS-10306
How was this patch tested?
https://github.com/hevinhsu/ozone/actions/runs/21741639312
Additionally, the following test was executed locally to verify the improvement:
before:
after: