ECHO Degraded Service Event

Event Period: 06/08/12 1500 - 1830 EST

Summary

What Happened:ECHO became unavailable due to problems with it's ElasticSearch

When:06/08/2012 15:00 - 18:30 EDT

Duration:2.5 hours

Planned Resolution:Upgrade of ElasticSearch and Java runtimes on database hosts

Details

On Friday, June 8, 2012 between approximately 15:00 and 18:30 EDT ECHO experienced a problem with its search cluster in the Operational Environment. Our search platform, ElasticSearch, had become unresponsive on all three nodes. After some initial time spent troubleshooting, the decision was made to restart ElasticSearch. Attempts to restart ElasticSearch failed due to a Linux system problem forcing the java process to fail with a segmentation fault and resulting in intranode communication problems. We rebooted the Linux hosts serving ElasticSearch and our Oracle Database, thus allowing ElasticSearch to start. While investigation of this problem has not provided a root cause, there are several observations and recommendations that should help in future system stability.

Timeline of Events

On Friday, around 12:12PM EDT, preceding the outage event, there was a single ElasticSearch logged error eventthat indicated a failure to execute a fetch.

Between 14:11 - 15:07, 206 events of the same type: "Failed to execute fetch phase" occurred on the same ElasticSearch node (node1). Most of these events were RemoteTransportExceptions citing a transport error with another of the nodes (node2) in the ElasticSearch, and only 5 citing the third node (node3). During this time period, Reverb began reporting of connection errors. The landing page was automatically put up by our High Availability proxy service.

Around 15:00, we began investigating the errors. It was noted that ElasticSearch was continuously garbage collecting and was using a large percentage of CPU resources. There was discussion of live debugging via jstack or other vm tools, but it was determined that this was not something we had a great deal of practice doing with ElasticSearch and could delay resumption of ECHO services. It was decided that the quickest course of action to resume service was to bring down ElasticSearch and restart.

At15:12, we initiated shutdown of the cluster.At 15:12, shutdown of the cluster appeared to be complete.

After shutting down the cluster, between 15:13:48 and 15:41:12, we made several attempts to restart ElasticSearch on node2 fail. All attempts failed with a segmentation fault of the Java process. Based on chat logs at the time, it was determined that ant (another Java process) was able to execute successfully to kick-off the ElasticSearch java process, but the ElasticSearch process itself was unable to start. It was also determined that the ElasticSearch Java process did no execute for a long enough time before crashing to run diagnostics on the ElasticSearch process.

Around 16:15 it was decided to reboot of all 3 Linux nodes in our ElasticSearch cluster to attempt a successful restart of the ElasticSearch service without having a Java segmentation fault upon launch.

A reboot of all 3 machines was initiated at 16:23. All processes were brought back normally and began replicating across nodes. This replication and data balancing took some time and at 17:46 searches were possible with slightly degraded performance. Ingest was resumed at 18:25 and an email was sent notifying stake holders of the resumption of service at 18:28.

Observations

·  System resources and behavior at the time of failure on node2 did not appear to be in any shortage. Memory, swap, run queue, buffers, i/o etc all appeared very normal.

Segfault of Java Virtual Machine and exit code 6 could be attributed to a number of causes in this order of probability:

1. A defect in the JavaVirtual Machine supporting ElasticSearch.

2. A JVM-related resource deficiency (heap, stack, etc).

3 A system-related resource deficiency (memory, disk, file handles, etc)

4. A defect in ElasticSearch,

Recommendations

This outage and its subsequent investigation and observations recommend the following actions:

·  Upgrade and test ElasticSearch to latest stable version (scheduled for release with ECHO 10.50)

·  Standardize the version of Java on each node supporting ElasticSearch (MacOSX and Linux)

·  Spend time familiarizing team with use of live debugging techniques including strace and jstack.