[BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing #17453

SIEMsational25 · 2025-02-25T13:45:20Z

Describe the bug

Unsure if this is an opensearch or wazuh specific problem but as i got no response whatsoever from wazuh, i´ll try again.

After upgrading to Wazuh Dashboard 4.10(an opensearch dashboard fork as i understand), the dashboard becomes unresponsive after some time. The issue presents itself with errors indicating OpenSearch shard failures, making the dashboard inaccessible.
This seems to be clearly due to an upgrade from opensearch 2.13 -> 2.16

Related component

No response

To Reproduce

Install or Upgrade to Wazuh 4.10(or 4.11), our test deployment basically creates new VM´s with the old hard drives and the new version.

After a while(~4 hours or sometimes half a day/day), the dashboard fails to connect, displaying errors.

Expected behavior

The Wazuh Dashboard should remain accessible and functional without OpenSearch shard failures.

Additional Details

{"error":{"root_cause":[{"type":"exception","reason":"java.util.concurrent.TimeoutException: Timeout after 10SECONDS while retrieving configuration for [INTERNALUSERS](index=.opendistro_security)"}],"type":"exception","reason":"java.util.concurrent.TimeoutException: Timeout after 10SECONDS while retrieving configuration for [INTERNALUSERS](index=.opendistro_security)","caused_by":{"type":"timeout_exception","reason":"Timeout after 10SECONDS while retrieving configuration for [INTERNALUSERS](index=.opendistro_security)"}},"status":500}
2025-02-17T16:13:00Z dashboards.out {"type":"log","@timestamp":"2025-02-17T16:13:00Z","tags":["error","opensearch","data"],"pid":1078,"message":"[search_phase_execution_exception]: all shards failed"}

2025-02-17T16:32:08Z indexer.out org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:770) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:810) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:548) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:290) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:373) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:941) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.16.0.jar:2.16.0]
2025-02-17T16:32:08Z indexer.out        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
2025-02-17T16:32:08Z indexer.out        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
2025-02-17T16:32:08Z indexer.out        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

Opensearch becomes unreachable for Logstash as well:

 2025-02-17T16:30:20Z wazuh-logstash.out [2025-02-17T16:30:20,515][ERROR][logstash.outputs.opensearch][archives][] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://s
wz:xxxxxx@indexer-ingest-***:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>64}

This may be related to the Opensearch Upgrade from 2.13 -> 2.16.
We are considering this critical because it would render our DB unusable as Opensearch cannot be downgraded and migration to a new cluster would not be feasible because of sheer disk size (TB realm)

Edit: Added more debug logs

The text was updated successfully, but these errors were encountered:

kkhatua · 2025-03-11T05:32:27Z

@derek-ho / @DarshitChanpura can you take a look at this?

SIEMsational25 added bug Something isn't working untriaged labels Feb 25, 2025

github-actions bot added the _No response_ label Feb 25, 2025

SIEMsational25 changed the title ~~[BUG] Opensearch breaks with 'all shards failed' error after running for hours~~ [BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing #17453

[BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing #17453

SIEMsational25 commented Feb 25, 2025 •

edited

Loading

kkhatua commented Mar 11, 2025

[BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing #17453

[BUG] Wazuh cluster dies with Opensearch 2.16 due to .opendistro_security index missing #17453

Comments

SIEMsational25 commented Feb 25, 2025 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

kkhatua commented Mar 11, 2025

SIEMsational25 commented Feb 25, 2025 •

edited

Loading