Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on Elasticsearch cluster configuration #365

Open
thepsalmist opened this issue Feb 12, 2025 · 9 comments
Open

Thoughts on Elasticsearch cluster configuration #365

thepsalmist opened this issue Feb 12, 2025 · 9 comments

Comments

@thepsalmist
Copy link
Contributor

thepsalmist commented Feb 12, 2025

As we prepare to configure the new server cluster, these are some thoughts/questions

  1. Our current ES setup had 3 nodes, with each node serving all eligible roles (master, data, ingest, coordinating). As we increase the number of nodes, we need to have dedicated master-eligible nodes (master only role).
    ES recommends at leats three master-eligible nodes for HA
  2. How is the current resource configuration of the nodes. Do we need to have bigger data nodes, and slightly smaller master eligible nodes. Data nodes handle all data related CRUD operations. From the ES audit, we established our use case does not specific data tiers, so we'll have a generic data node.
  3. Explicit declaration of a co-ordinating only node. Cordinating nodes acts as our smart load balancers. Coordinating only nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible node.
    We can however use the master eligible nodes also as coordinating nodes, though ES advises against this configuration.
@pgulley
Copy link
Member

pgulley commented Feb 14, 2025

The way I'm thinking now, there's three big questions to answer and we kind of have to answer them in sequence.

  1. Index mapping: We're already mostly settled here, I think. The biggest conclusion is that we should disable fielddata everywhere- but there are some other open questions from If we ever reindex... #344, Specifically around keyword fields and date representation.
  2. Shard size: This is a value that there's an empirical process for determining, so we just have to follow that process. I imagine our current shard size would be a fine starting place, and the question is if we want to shrink that value at all.
  3. Dedicated Master Nodes: The advice online is pretty clearly in favor of dedicated master nodes for our scale. I think while we're testing the previous 2 questions we can continue having all nodes serve all roles and keep an eye on how things scale. If we determine that we need dedicated master nodes, we need to think about how to host them- certainly all the other machines in the Angwin cluster are oversized for the task, and the eight new nodes were spec'd as Data Nodes. Would it be absurd to host the master nodes on machines with other tenants or is it essential that the master machines serve no other purpose? We could just set the three data nodes we've already got as part-time masters.

I think for now we leave the question of coordinating nodes for future optimization- that document linked says "The benefit of coordinating only nodes should not be overstated — data nodes can happily serve the same purpose."

@philbudne
Copy link
Contributor

philbudne commented Feb 15, 2025 via email

@pgulley
Copy link
Member

pgulley commented Feb 18, 2025

It's always possible to change the node type distribution down the line if we find our configuration needs adjusting. Three master-eligible nodes and five pure data nodes seems like a fine way to approach this to start.

What are the considerations that are blocking us from setting up a shard-size experiment tomorrow, if we want to move on iterating asap?

Also wondering how we would benchmark the reindex time- can we use the reindex api when bringing up test indexes, and keep a timer running so we have a sense of the time cost per story? Is that too simple an approach?

@philbudne
Copy link
Contributor

philbudne commented Feb 18, 2025 via email

@thepsalmist
Copy link
Contributor Author

If we're not using ansible, I'd like to have the ES install script (or
a parent top level script) also install the statsd-agent which reports
server stats to statsd/graphite/grafana.

I would also propose doing this using Ansible, since having the scripts in the story-indexer repo works, however this means to install this we'd probably need to execute this in every host (8 in this case??), while with Ansible this could be executed from a single agent.

@pgulley
Copy link
Member

pgulley commented Feb 19, 2025

I had always assumed we would want to use Ansible!

Here's a testing framework for cluster optimization, surprised it hadn't come up before: https://github.com/elastic/rally

@rahulbot
Copy link
Contributor

Legacy note: we used Ansible effectively in the old system to spin up machines with just a quick command. This turned out to be very useful from my perspective, even if it seemed like there was a learning curve to using it.

@pgulley
Copy link
Member

pgulley commented Feb 20, 2025

Re: Elastic Rally- it looks like it's possible to make "test tracks" from existing cluster data for new clusters: https://esrally.readthedocs.io/en/stable/adding_tracks.html#creating-a-track-from-data-in-an-existing-cluster

Rally won't give us the ability to test bare metal configuration details like the JVM Heap size, but I think we already know the right values to use there. Shard size should be doable though, if we want to identify 4/8 sample shard sizes we can automate a test and just leave it running for a couple days .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants