- We've confirmed that the changes made by the Elasticsearch team
fix performance issues with hot-warm deployments. We're continuing to work closely with the Elasticsearch team to make this available to Cloud customers as soon as possible. We will update you again when we have a timeframe on when the release will be available on Cloud and available to download via elastic.co/downloads
May 22, 06:15 UTC
- After digging into the weeds of Linux file systems (IO schedulers, xfsslower tracing and friends), we suspect that the performance degradation is due to excessive fsyncs. In-order to support Cross Cluster Replication
, Elasticsearch retains historical operations on certain indices. The amount of historical operations that are retained in an index is controlled by a new mechanism called retention leases
. The leases are maintained by the primary copy of each shard and are synchronized to the replicas. With every synchronization, we issue an fsync to the file system to persist the file where the leases are stored. For simplicity, we are currently syncing the leases every 30 seconds. Sadly, for clusters which have a lot of shards and run on spinning disks (hello, warm nodes!) this creates a lot of fsyncs. Those numerous fsyncs appear to be causing heavy IO load on the machines and cause delays in persisting cluster state updates to disk. The delays can be so large that the new cluster coordination subsystem
deems the nodes unstable and removes them from the cluster. These fsyncs only arise on indices created since 6.5 with a special index setting
; to support future features, that setting is the default for indices created since 7.0. The Elasticsearch team has already created a pull request
to fix this issue and we are currently working on confirming it in our staging Cloud envirionment. Once we have confirmed that the pull request fixes the issue, we will take the necessary next steps to get this fixed for all impacted Cloud users (and other users of Elasticsearch). We will update you again when we have confirmation of the fix (ETA 6 hours).
May 21, 21:22 UTC
- We've determined this issue is only affecting up to 5% of hot-warm deployments in both AWS and GCP based regions. Customers with hot-warm deployments experiencing any of the following symptoms are encouraged to contact email@example.com
* Slow response times
* API calls timing out
* Temporarily unavailable shards on the warm tier
* New hot-warm deployments timing out during initial deployment
We'll update this incident within 24 hours, or when we determine an effective mitigation for the affected instances, and a longer term fix.
May 20, 23:52 UTC
- We're currently looking into short-term mitigations for the disk performance issues with warm-tier nodes in hot-warm deployments in AWS. We are also discussing potential longer term solutions for the disk performance. Further updates on this issue will come within the next 6 hours. If you are impacted by this issue or have questions related to the issue, please contact firstname.lastname@example.org
May 20, 18:14 UTC
Update - We are continuing to investigate this issue.
May 20, 16:15 UTC
- We're currently experiencing performance issues with a small percentage of warm-tier nodes in hot-warm deployments in AWS. We have identified the issue and it seems to be related to disk performance on that server tier. Customers may experience the following symptoms: slow response times, timed out API calls, and temporarily unavailable shards on the warm tier, and new hot-warm deployments may time out during initial deployment. We are currently determining a long-term fix for the problem. If you are impacted by this issue, please contact email@example.com
May 20, 15:48 UTC