Reader small image

You're reading from  Monitoring Elasticsearch

Product typeBook
Published inJul 2016
PublisherPackt
ISBN-139781784397807
Edition1st Edition
Right arrow
Authors (3):
Dan Noble
Dan Noble
author image
Dan Noble

About the Author Dan is a software engineer with a passion for writing secure, clean, and articulate code. He enjoys working with a variety of programming languages and software frameworks, particularly Python, Elasticsearch, and frontend technologies. Dan currently works on geospatial web applications and data processing systems. Dan has been a user and advocate of Elasticsearch since 2011. He has given talks about Elasticsearch at various meetup groups, and is the author of the Python Elasticsearch client “rawes.” Dan was also a technical editor for the Elasticsearch Cookbook, Second Edition, by Alberto Paro (ISBN: 1783554835). Acknowledgements I would like to thank my beautiful wife, Julie, for putting up with me while I wrote this book. Thanks for supporting me every step of the way. I would also like to thank my friends and colleagues James Cubeta, Joe McMahon, and Mahmoud Lababidi, who shared their insight, time, and support. I would like to give a special thanks to Abe Usher – you have been an incredible mentor over the years. Finally, thanks to everyone at Packt Publishing for helping to make this book happen. A special thanks to Merint Mathew, Sonali Vernekar, Husain Kanchwala, and Amey Varangaonkar for your valuable and careful feedback.
Read more about Dan Noble

View More author details
Right arrow

Chapter 7. Node Failure and Post-Mortem Analysis

In the previous chapter, we learned how to troubleshoot common performance and reliability issues that come up when using Elasticsearch using case studies with real-world examples. This chapter explores some common causes of node and cluster failures. Specific topics covered are as follows:

  • How to determine the root cause of a failure

  • How to take corrective action for node failures

  • Case studies with real-world examples of diagnosing system failures

Diagnosing problems


Elasticsearch node failures can manifest in many different ways. Some of the symptoms of node failures are as follows:

  • A node crashes during heavy data indexing

  • Elasticsearch process stops running for an unknown reason

  • A cluster won't recover from a yellow or red state

  • Query requests time out

  • Index requests time out

When a node in your cluster experiences problems such as these, it can be tempting to just restart Elasticsearch or the node itself and move on like nothing happened. However, without addressing the underlying issue, the problem is likely to resurface in the future. If you encounter scenarios such as the ones just listed, check the health of your cluster in the following manner:

  • Check the cluster health with Elasticsearch-head or Kopf

  • Check the historical health with Marvel

  • Check for Nagios alerts

  • Check Elasticsearch log files

  • Check system log files

  • Check the system health using command-line tools

These steps will help diagnose the root cause of problems in your cluster...

Reviewing some case studies


This section discusses some real-world scenarios of Elasticsearch node failure and how to address them.

The ES process quits unexpectedly

A few weeks ago we noticed in Marvel that the Elasticsearch process was down on one of our nodes. We restarted Elasticsearch on this node, and everything seemed to return to normal. However, checking Marvel later on in the week, we notice that the node is down again. We decide to look at the Elasticsearch log files, but don't notice any exceptions. As we don't see anything in the Elasticsearch log, we suspect that the operating system may have killed Elasticsearch. Checking syslog at /var/log/syslog, we see the error:

Out of memory: Kill process 5969 (java) score 446 or sacrifice child

This verifies that the operating system killed Elasticsearch because the system was running out of memory. We check the Elasticsearch configuration and don't see any issues. This node is configured in the same way as the other nodes in the cluster...

Summary


This chapter looked into how to diagnose node failures, determine the root cause of the problem, and apply corrective action. Some key things we learned are:

  • Many errors, from shard failures to slow query performance, are caused by OutOfMemoryError exceptions

  • Running out of disk space on one node can cause other nodes to run out of disk space as well when shards are reallocated

  • Running Elasticsearch alongside other services that require a lot of memory can result in the operating system killing Elasticsearch to free up memory

The next chapter will talk about Elasticsearch 5.0, the next major release of the platform, and it will give you an overview of the various new monitoring tools that will accompany the Elasticsearch 5.0 release.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Monitoring Elasticsearch
Published in: Jul 2016Publisher: PacktISBN-13: 9781784397807
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Dan Noble

About the Author Dan is a software engineer with a passion for writing secure, clean, and articulate code. He enjoys working with a variety of programming languages and software frameworks, particularly Python, Elasticsearch, and frontend technologies. Dan currently works on geospatial web applications and data processing systems. Dan has been a user and advocate of Elasticsearch since 2011. He has given talks about Elasticsearch at various meetup groups, and is the author of the Python Elasticsearch client “rawes.” Dan was also a technical editor for the Elasticsearch Cookbook, Second Edition, by Alberto Paro (ISBN: 1783554835). Acknowledgements I would like to thank my beautiful wife, Julie, for putting up with me while I wrote this book. Thanks for supporting me every step of the way. I would also like to thank my friends and colleagues James Cubeta, Joe McMahon, and Mahmoud Lababidi, who shared their insight, time, and support. I would like to give a special thanks to Abe Usher – you have been an incredible mentor over the years. Finally, thanks to everyone at Packt Publishing for helping to make this book happen. A special thanks to Merint Mathew, Sonali Vernekar, Husain Kanchwala, and Amey Varangaonkar for your valuable and careful feedback.
Read more about Dan Noble