Disable delete optimization and exit ref loop faster#6219
Disable delete optimization and exit ref loop faster#6219ddanielr wants to merge 5 commits intoapache:2.1from
Conversation
When delete table is called, the delete marker code checks to see if any file references exist in other tables. However, only a single reference has to exist for delete markers to be created. Added break out of for loop once a single entry was found. Fixed log lines and removed a `getLogger` call. Removed a nested try block in favor of a single try-with-resources
Adds a property to allow the scan of the metadata table to be skipped for table deletes. This forces delete markers to always be created when deleting tables instead of the manager deleting the volumes immediately.
|
@keith-turner & @ctubbsii I'm also leaning towards just removing this delete marker optimization entirely. I wanted to be a bit more careful about it in 2.1 but if you don't see a reason to keep this code around then I'll push a change to just remove the scan and volume deletion shortcut entirely. |
Improve property description Co-authored-by: Dave Marion <dlmarion@apache.org>
I support removing this optimization in general. Do share the concerns about introducing bugs in 2.1. Looked at what test we have for delete table to see if any verified the files actually eventually go away. I could not find any test that verifies this. I looked at test method names that called TableOperations.delete and for test I was not sure about based on the name I looked at the test. Maybe part of removing this in 2.1 is adding some more test. Want to ensure the files exists before delete and that after delete the tables file and dir eventually go away. That would cover the risk that delete table leaves files behind w/ this change. If delete does more than its supposed to (like messes w/ another table) I suspect existing test would catch that as we create/delete a lot in the test. |
Adds tests to verify that rfiles are removed from HDFS and not just from the table metadata.
Added additional tests in 4ed3071 |
| var names = getUniqueNames(2); | ||
| final String sourceTable = names[0]; | ||
| final String cloneTable = names[1]; | ||
| ; |
| Wait.waitFor( | ||
| () -> getServerContext().getAmple().getGcCandidates(Ample.DataLevel.USER).hasNext(), | ||
| 1_000); | ||
| getCluster().getClusterControl().start(ServerType.GARBAGE_COLLECTOR); |
There was a problem hiding this comment.
Before starting the GC could check the table dir exists again. Then just know that after the candidates were seen that the table dir still exist. This gives a confirmation that the background op that created the candidates did not do anything w/ dfs.
| // Create a pre-split table so there will be multiple tablet directories. | ||
| client.tableOperations().create(tableName); | ||
| client.tableOperations().addSplits(tableName, new java.util.TreeSet<>( | ||
| List.of(new org.apache.hadoop.io.Text("m"), new org.apache.hadoop.io.Text("t")))); |
There was a problem hiding this comment.
Would be better to create splits that align w/ the data being written like rowXYZ. Then should get data in all splits.
| final Path tableDir = returnTableHdfsDir(tableId); | ||
|
|
||
| assertTrue(fs.exists(tableDir), "Table dir must exist: " + tableDir); | ||
| assertTrue(hasRFiles(fs, tableDir), "RFiles must exist before delete: " + tableDir); |
There was a problem hiding this comment.
If writing data to all splits, this check could include a number like hasRFiles(fs, tableDir, 3)
| "Cloned HDFS directory must survive after source table is deleted: " + cloneDir); | ||
| assertTrue(hasRFiles(fs, cloneDir), | ||
| "Cloned RFiles must survive after source table is deleted: " + cloneDir); | ||
|
|
There was a problem hiding this comment.
Could scan the clone after deleting the source and before deleting the clone. The scan could just count using a stream and ensure it sees the expected number.
|
|
||
| boolean findSourceTableGcCandidates = false; | ||
| var gcCandidates = | ||
| getCluster().getServerContext().getAmple().getGcCandidates(Ample.DataLevel.USER); |
There was a problem hiding this comment.
Since the GC process is running it may remove the candidates before the test sees them. Could lead to occasional test failures.
There was a problem hiding this comment.
May be able to move the compact operation after the check for candidates. Then would not need to stop GC and the candidates should be there until sometime after the compaction.
| } | ||
|
|
||
| writeAndFlush(client, cloneTable, 100); | ||
| client.tableOperations().setProperty(cloneTable, Property.TABLE_FILE_MAX.getKey(), "1"); |
There was a problem hiding this comment.
Should not need to set this. Forcing a table compaction will compact down to one file. That prop impacts system compactions.
This PR adds a property to skip scanning the metadata table for table deletes.
DeleteMarker Creation Optimization
When the manager deletes a table it performs an optimization step by creating a batch scanner with 8 threads (not configurable) to scan all the other table file references on the metadata table and ensure that no file references are found for the given table volume.
The manager then directly deletes the volumes as opposed to writing delete markers and allowing the GC to handle the tablet file deletions.
This is a nice optimization to have when dealing with a small static set of tables. However, when table creation is dynamic these scans can cause unnecessary delays and/or hanging fate processes as all metadata tablets must be scanned in order to process a single table delete.
Unnecessary File Ref Counting
The batch scanner only needs to produce a single shared file ref result (
refCount) in order to trigger delete markers to be created as the code only checks ifrefCountis equal to zero.However, the existing code needlessly counts all of the found refs first.
This is unnecessary and a fast break was added to the iterator loop for the batch scanner.