WIP initial steps for distributing compaction coordination#6217
WIP initial steps for distributing compaction coordination#6217keith-turner wants to merge 53 commits intoapache:mainfrom
Conversation
There is still only one coordinator, but now the TGW and compactors both could talk to multiple coordinators.
The set of shutting down tservers was causing system fate operations to have to run on the primary manager because this was an in memory set. This caused fate to have different code paths to user vs system fate, this in turn caused problems when trying to distribute compaction coordination. To fix this problem moved the set from an in memory set to a set in zookeeper. The set is managed by fate operations which simplifies the existing code. Only fate operations add and remove from the set and fate keys are used to ensure only one fate operation runs at a time for a tserver instance. The previous in memory set had a lot of code to try to keep it in sync with reality, that is all gone now. There were many bugs with this code in the past. After this change is made fate can be simplified in a follow on commit to remove all specialization for the primary manager. Also the monitor can now directly access this set instead of making an RPC to the manager, will open a follow on issue for this.
After this change meta fate and user fate are both treated mostly the same in the managers. One difference is in assignment, the entire meta fate range is assigned to a single manager. User fate is spread across all managers. But both are assigned out by the primary manager using the same RPCs now. The primary manager used to directly start a meta fate instance. Was able to remove the extension of FateEnv from the manager class in this change, that caused a ripple of test changes. But now there are no longer two different implementations of FateEnv
Before this change a fate client was only available on the primary manager. Now fate clients are avaiable on all managers. The primary manager publishes fate assignment locations in zookeeper. These locations are used by managers to send notifications to other managers when they seed a fate operation.
|
After merging in the changes from #6232 a basic test of compaction coordination spread across multiple managers is now working. Was failing before. Still alot of loose ends and refactoring that is needed, but the basic functionality seems to be working now. |
Simplifies warn logging about inactive queues in the coordinator by using only information from the job queue to do the logging. Sets of compactors and information from the job queue was being used previously. Removes a usage of the running cache in the coordinator. Simplifying this logic to not use the running cache or sets of compactors will be helpful for apache#6217.
This logger was useful in 2.1. However in 4.0 its redundant with functionality in the monitor and is just extra code to maintain. Removing it also removes a usage of the running cache which may be helpful for apache#6217
Compaction and completion and failure were computing stats in the coordinator that needed the resource group and compactor address. This information was obtained from the running cache. Modified the RPCs to pass this information instead. This removes a usage of the running cache in coordinator which will be helpful for apache#6217.
This change is made in support of apache#6217. Successfully ran all ITs that were changed.
Modified code that used this RPC to get the same information directly from compactors instead. Making this change in support of apache#6217
Modified the command that calls this to instead reach out directly to the compactor. This change is made in support of apache#6217, it removes a useage of the running cahce and it simplifies the coordinators RPCs.
This logger was useful in 2.1. However in 4.0 its redundant with functionality in the monitor and is just extra code to maintain. Removing it also removes a usage of the running cache which may be helpful for #6217
Modified the command that calls this to instead reach out directly to the compactor. This change is made in support of #6217, it removes a useage of the running cahce and it simplifies the coordinators RPCs.
Modified code that used this RPC to get the same information directly from compactors instead. Making this change in support of #6217
Compaction job queues had a getAsync method that were not used. This was causing complexity w/ apache#6217, so removing them. Can add them back later if needed.
|
This change is very close to being done and ready for review. The current problem that I am running into is dealing with how the compaction coordinator deals with deleting tmp files from failed compactions. The current process is event driven and directly responds to failed compaction notifications. This will not work well when these notification are now happening across all manager processes. Also w/ the current code if the process dies before processing the notification then the tmp file may never be cleaned up. It may be better to create a new periodic process that compares compaction tmp files in dfs to running compactions in the metadata and fate table. However, concerned about this scanning all of DFS periodically. Was wondering about putting tmp files in a single dir for each table on each volume so they are easier to find. Also, this could be a periodic task the Accumulo GC runs instead of the primary manager. Probably need to address this in a separate PR before this can be completed. |
Removed an in memory set of tables ids in the dead compaction detectors that contained table ids that may have compaction tmp files that needed cleanup. This set would be hard to maintain in multiple managers. Also the set could lose track of tables if the process died. Replaced the in memory set with a set in the metadata table. This set is directly populated by the split and merge fate operations, so there is no chance of losing track of things when a process dies. Also this set is more narrow and allows looking for tmp files to cleanup in single tablets dirs rather than scanning an entire tables dir. Also made a change to the order in which tmp files are deleted for failed compactions. They used to be deleted after the metadata for the compaction was cleaned up, this could lead to losing track of the cleanup if the process died after deleting the metadata but before deleting the tmp file. Now the tmp files are deleted before the metadata entry, so should no longer lose track in process death. This change is needed by apache#6217
Some initial steps for distributing compaction coordinator. Currently only contains the following.
Still need to do the following
Plan to continue experimenting with this.