Cluster Connectivity in us-east-1

Incident Report for ESS (Public)

Resolved

We have been monitoring for the past 2 hours and will continue to do so, but marking the incident as resolved as we have seen error rates stay at a normal level.

Posted Nov 05, 2017 - 23:02 UTC

Monitoring

We are continuing to monitor our environment as cluster availability has returned back to normal. We will continue to update as we monitor and eventually resolve the incident.

Posted Nov 05, 2017 - 22:01 UTC

Identified

We have identified the root cause(s) of today's incident, and have gotten error rates back down to a normal level for Elasticsearch and Kibana requests. Users should be able to successfully manage and explore within their clusters.

Posted Nov 05, 2017 - 21:29 UTC

Update

We are seeing an improvement in access for our Kibana users, while we still continue to investigate the root cause of our incident.

Posted Nov 05, 2017 - 20:52 UTC

Update

We are still investigating the issue, and continuing to see status errors mostly for our Kibana users.

Posted Nov 05, 2017 - 20:02 UTC

Update

We are still investigating the issue. Any of our users who rely more heavily on Kibana are likely to run into status errors as they navigate to their Elasticsearch clusters. We will update as soon as we are able to provide more details.

Posted Nov 05, 2017 - 19:20 UTC

Update

We are still investigating the issue, status remains the same with <5% of Elasticsearch clusters impacted.

Posted Nov 05, 2017 - 18:59 UTC

Update

We are still investigating the issue, and will provide an update when we have more details.

Posted Nov 05, 2017 - 18:03 UTC

Update

We are still looking into the issue. A small percentage (less than 5%) of Elasticsearch clusters have been impacted and are experiencing connectivity issues, but the problem seems to be more widespread for Kibana instances. We will update as soon as we have more details.

Posted Nov 05, 2017 - 17:28 UTC

Investigating

We are still investigating the issue, continuing to work through mitigation steps.

Posted Nov 05, 2017 - 17:00 UTC

Update

We're having issues with properly propagating routing tables to all our proxies in us-east-1. This causes a subset of the proxies to have stale routing info. We're still actively working the issue.

Posted Nov 05, 2017 - 16:02 UTC

Update

We're still working through mitigation steps.

Posted Nov 05, 2017 - 15:04 UTC

Identified

We're continuing to mitigate the connections issues affecting some clusters in us-east-1. These appear as intermittent 502s. The majority of clusters are not affected, so we are moving as fast as we can while not disrupting those.

Posted Nov 05, 2017 - 13:07 UTC

Update

We are currently seeing connectivity issues to clusters in us-east-1 and are actively investigating the extent of those.

Posted Nov 05, 2017 - 12:16 UTC

Investigating

We are currently investigating this issue.

Posted Nov 05, 2017 - 11:40 UTC