Adds monitor endpoint to show manager responsibilites#6309
Adds monitor endpoint to show manager responsibilites#6309keith-turner wants to merge 1 commit intoapache:mainfrom
Conversation
|
The output of this looks like the following when running 5 managers. {
"localhost:10000": [
"FatePartition[start=FATE:META:00000000-0000-0000-0000-000000000000, end=FATE:META:ffffffff-ffff-ffff-ffff-ffffffffffff]",
"FatePartition[start=FATE:USER:cccccccc-cccc-ccc0-0000-000000000000, end=FATE:USER:ffffffff-ffff-ffff-ffff-ffffffffffff]",
"TABLET_MANAGEMENT",
"BALANCING",
"CLIENT_RPC",
"TSERVER_MONITORING",
"CLUSTER_MAINTENANCE",
"COMPACTION_COORDINATION"
],
"localhost:10001": [
"FatePartition[start=FATE:USER:00000000-0000-0000-0000-000000000000, end=FATE:USER:33333333-3333-3330-0000-000000000000]"
],
"localhost:9999": [
"FatePartition[start=FATE:USER:33333333-3333-3330-0000-000000000000, end=FATE:USER:66666666-6666-6660-0000-000000000000]"
],
"localhost:10002": [
"FatePartition[start=FATE:USER:99999999-9999-9990-0000-000000000000, end=FATE:USER:cccccccc-cccc-ccc0-0000-000000000000]"
],
"localhost:10003": [
"FatePartition[start=FATE:USER:66666666-6666-6660-0000-000000000000, end=FATE:USER:99999999-9999-9990-0000-000000000000]"
]
} |
|
What function does TSERVER_MONITORING equate to? In the Manager view in #6278 I think all of these functions can be deduced from the metrics. For example, the presence of balancer metrics implies balancing, the presence of compaction metrics implies coordination, etc. Now, if we remove those metrics, then we will need something. How do you envision displaying the Fate partition information? |
|
As an experiment, I pulled these changes into #6217 and modified them to display info about compaction coordination. {
"localhost:10000": [
"FatePartition[start=FATE:USER:cccccccc-cccc-ccc0-0000-000000000000, end=FATE:USER:ffffffff-ffff-ffff-ffff-ffffffffffff]",
"FatePartition[start=FATE:META:00000000-0000-0000-0000-000000000000, end=FATE:META:ffffffff-ffff-ffff-ffff-ffffffffffff]",
"TABLET_MANAGEMENT",
"BALANCING",
"CLIENT_RPC",
"TSERVER_MONITORING",
"CLUSTER_MAINTENANCE",
"COMPACTOR_GROUPS:[accumulo]"
],
"localhost:9999": [
"FatePartition[start=FATE:USER:33333333-3333-3330-0000-000000000000, end=FATE:USER:66666666-6666-6660-0000-000000000000]",
"COMPACTOR_GROUPS:[ci_lrg]"
],
"localhost:10001": [
"FatePartition[start=FATE:USER:00000000-0000-0000-0000-000000000000, end=FATE:USER:33333333-3333-3330-0000-000000000000]",
"COMPACTOR_GROUPS:[ci_small]"
],
"localhost:10002": [
"FatePartition[start=FATE:USER:99999999-9999-9990-0000-000000000000, end=FATE:USER:cccccccc-cccc-ccc0-0000-000000000000]",
"COMPACTOR_GROUPS:[default]"
],
"localhost:10003": [
"FatePartition[start=FATE:USER:66666666-6666-6660-0000-000000000000, end=FATE:USER:99999999-9999-9990-0000-000000000000]"
]
} |
The manager periodically pings all the tservers to get stats and it will eventually attempt to kill unresponsive tservers. We may remove the custom stats collection in favor of metrics. May still want to keep this ping functionality though to check for tservers that have a lock but can not be reached via RPC. This may be easy to eventually distribute across the managers, just have each assistant manger hash mod the tservers and only ping the ones that match its ordinal.
I am not completely sure, I kinda like the way the formatted json is displaying it with a list of responsibilities under each manager addr.
We probably could do that for a lot of this info. The current impl pulls everything from zoocache and will give the most up to date and consistent info. For metrics there may be lag like if a compactor group is moved from manager A to manager B, not sure when that would work its way through w/ metrics. Pulling the small amount of data directly from zoocache will be quick and up to date. |
| @Path("manager/responsibilities") | ||
| @Produces(MediaType.APPLICATION_JSON) | ||
| @Description("Returns each managers responsibilities") | ||
| public Map<String,List<String>> getManagerResponsibilities() { |
There was a problem hiding this comment.
Putting this here, instead of in SystemInformation, will hit ZooKeeper every time this endpoint is hit, so every page refresh for every browser showing the Monitor. The model that we have been using so far is to store the information we want to display in the SystemInformation object, which will provide a consistent response for a point in time until the object is refreshed. This would make this endpoint real-time vs point-in-time.
Needed for #6190