You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
[2024-05-18T18:44:03,287][DEBUG][o.o.a.a.c.s.TransportClusterUpdateSettingsAction] [b869f183befc74cff9f3b5572821ec21] #[org.opensearch.cluster.metadata.ProcessClusterEventTimeoutException]#failed to perform [cluster_update_settings]
ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 1m]
at org.opensearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:200)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at org.opensearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:199)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Notice the time spent in the loop
99.1% (9.9s out of 10s) cpu usage by thread 'opensearch[b869f183befc74cff9f3b5572821ec21][clusterManagerService#updateTask][T#1]'
9/10 snapshots sharing following 20 elements
app//org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:94)
app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.decideAllocateUnassigned(LocalShardsBalancer.java:927)
app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.allocateUnassigned(LocalShardsBalancer.java:813)
app//org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:288)
app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:557)
app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:501)
app//org.opensearch.action.admin.cluster.reroute.TransportClusterRerouteAction$ClusterRerouteResponseAckedClusterStateUpdateTask.execute(TransportClusterRerouteAction.java:269)
app//org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base@17.0.9/java.lang.Thread.run(Thread.java:840)
unique snapshot
java.base@17.0.9/java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1051)
java.base@17.0.9/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1050)
app//org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:93)
app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.decideAllocateUnassigned(LocalShardsBalancer.java:927)
app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.allocateUnassigned(LocalShardsBalancer.java:813)
app//org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:288)
app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:557)
app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:501)
app//org.opensearch.action.admin.cluster.reroute.TransportClusterRerouteAction$ClusterRerouteResponseAckedClusterStateUpdateTask.execute(TransportClusterRerouteAction.java:269)
app//org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base@17.0.9/java.lang.Thread.run(Thread.java:840)
Individual thread dumps
"opensearch[b869f183befc74cff9f3b5572821ec21][clusterManagerService#updateTask][T#1]" #3427 daemon prio=5 os_prio=0 cpu=153387532.42ms elapsed=184422.36s tid=0x0000fffee40040e0 nid=0x1152 runnable [0x0000fffdf1ba1000]
java.lang.Thread.State: RUNNABLE
at org.opensearch.common.settings.Setting.get(Setting.java:506)
at org.opensearch.common.settings.Setting.get(Setting.java:477)
at org.opensearch.common.settings.Setting.get(Setting.java:623)
at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDecider.doDecide(ShardsLimitAllocationDecider.java:129)
at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDecider.canAllocate(ShardsLimitAllocationDecider.java:113)
at org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:94)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.decideAllocateUnassigned(LocalShardsBalancer.java:927)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.allocateUnassigned(LocalShardsBalancer.java:813)
at org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:288)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:557)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:501)
at org.opensearch.action.admin.cluster.reroute.TransportClusterRerouteAction$ClusterRerouteResponseAckedClusterStateUpdateTask.execute(TransportClusterRerouteAction.java:269)
at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
"clusterManagerService#updateTask"
"opensearch[b869f183befc74cff9f3b5572821ec21][clusterManagerService#updateTask][T#1]" #3427 daemon prio=5 os_prio=0 cpu=153454163.28ms elapsed=184496.35s tid=0x0000fffee40040e0 nid=0x1152 runnable [0x0000fffdf1ba1000]
java.lang.Thread.State: RUNNABLE
at java.util.Spliterator.getExactSizeIfKnown(java.base@17.0.9/Spliterator.java:414)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(java.base@17.0.9/AbstractPipeline.java:526)
at java.util.stream.AbstractPipeline.copyInto(java.base@17.0.9/AbstractPipeline.java:513)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(java.base@17.0.9/AbstractPipeline.java:499)
at java.util.stream.FindOps$FindOp.evaluateSequential(java.base@17.0.9/FindOps.java:150)
at java.util.stream.AbstractPipeline.evaluate(java.base@17.0.9/AbstractPipeline.java:234)
at java.util.stream.IntPipeline.findFirst(java.base@17.0.9/IntPipeline.java:552)
at java.text.DecimalFormatSymbols.findNonFormatChar(java.base@17.0.9/DecimalFormatSymbols.java:844)
at java.text.DecimalFormatSymbols.initialize(java.base@17.0.9/DecimalFormatSymbols.java:815)
at java.text.DecimalFormatSymbols.<init>(java.base@17.0.9/DecimalFormatSymbols.java:115)
at sun.util.locale.provider.DecimalFormatSymbolsProviderImpl.getInstance(java.base@17.0.9/DecimalFormatSymbolsProviderImpl.java:85)
at java.text.DecimalFormatSymbols.getInstance(java.base@17.0.9/DecimalFormatSymbols.java:182)
at java.util.Formatter.zero(java.base@17.0.9/Formatter.java:2450)
at java.util.Formatter$FormatSpecifier.getZero(java.base@17.0.9/Formatter.java:4450)
at java.util.Formatter$FormatSpecifier.localizedMagnitude(java.base@17.0.9/Formatter.java:4466)
at java.util.Formatter$FormatSpecifier.print(java.base@17.0.9/Formatter.java:3276)
at java.util.Formatter$FormatSpecifier.print(java.base@17.0.9/Formatter.java:3261)
at java.util.Formatter$FormatSpecifier.printInteger(java.base@17.0.9/Formatter.java:2957)
at java.util.Formatter$FormatSpecifier.print(java.base@17.0.9/Formatter.java:2918)
at java.util.Formatter.format(java.base@17.0.9/Formatter.java:2689)
at java.util.Formatter.format(java.base@17.0.9/Formatter.java:2625)
at java.lang.String.format(java.base@17.0.9/String.java:4186)
at org.opensearch.cluster.routing.allocation.decider.ThrottlingAllocationDecider.allocateNonInitialShardCopies(ThrottlingAllocationDecider.java:243)
at org.opensearch.cluster.routing.allocation.decider.ThrottlingAllocationDecider.canAllocate(ThrottlingAllocationDecider.java:200)
at org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:94)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.decideAllocateUnassigned(LocalShardsBalancer.java:927)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.allocateUnassigned(LocalShardsBalancer.java:813)
at org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:288)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:557)
"clusterManagerService#updateTask"
"opensearch[b869f183befc74cff9f3b5572821ec21][clusterManagerService#updateTask][T#1]" #3427 daemon prio=5 os_prio=0 cpu=153486544.07ms elapsed=184535.29s tid=0x0000fffee40040e0 nid=0x1152 runnable [0x0000fffdf1ba1000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.hash(java.base@17.0.9/HashMap.java:338)
at java.util.HashMap.getNode(java.base@17.0.9/HashMap.java:568)
at java.util.HashMap.get(java.base@17.0.9/HashMap.java:556)
at java.util.Collections$UnmodifiableMap.get(java.base@17.0.9/Collections.java:1502)
at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDecider.getDiskUsage(DiskThresholdDecider.java:577)
at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDecider.canAllocate(DiskThresholdDecider.java:224)
at org.opensearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:94)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.decideAllocateUnassigned(LocalShardsBalancer.java:927)
at org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.allocateUnassigned(LocalShardsBalancer.java:813)
at org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:288)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:557)
at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:501)
at org.opensearch.action.admin.cluster.reroute.TransportClusterRerouteAction$ClusterRerouteResponseAckedClusterStateUpdateTask.execute(TransportClusterRerouteAction.java:269)
at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
Describe the solution you'd like
Break long running execution into smaller batched execution with checkpoints so as to allow other critical operations to get unblocked between two batch execution
Parallelise allocation decider executions to speed up single threaded cluster manager task execution
Make long running executions like reroute and routing table processing non-blocking, evaluate optimistic concurrency controls
Related component
ShardManagement:Performance
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Another thing is to run these deciders on batch of shards instead of one shard at a time as for large cluster running these deciders per shards slows down the decider significantly.
Is your feature request related to a problem? Please describe
Notice the time spent in the loop
Individual thread dumps
Describe the solution you'd like
Related component
ShardManagement:Performance
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: