Managing Alerts

Exclusive offer: get 50% off this eBook here
Mastering Zabbix

Mastering Zabbix — Save 50%

Monitor your large IT environment efficiently with Zabbix with this book and ebook

$29.99    $15.00
by Andrea Dalle Vacche Stefano Kewan Lee | December 2013 | Open Source

Checking conditions and alarms are the most characteristic functions of any monitoring system, and Zabbix is no exception. What really sets Zabbix apart is that every alarm condition, or triggers—as they are known in this system—can be tied, not only to a single measurement but also to an arbitrary complex calculation, based on all of the data available to the Zabbix server. Furthermore, just as triggers are independent from items, the actions that the server can take based on trigger status are independent from the single trigger, as you will see in the subsequent sections.

In this article written by Andrea Dalle Vacche and Stefano Kewan Lee, authors of the book Mastering Zabbix, you will learn the following things about triggers and actions:

  • How to create complex, intelligent triggers
  • How to minimize the possibility of false positives
  • How to set up Zabbix to take automatic actions based on the trigger status
  • How to rely on escalating actions

(For more resources related to this topic, see here.)

An efficient, correct, and comprehensive alerting configuration is a key to the success of a monitoring system. It's based on an extensive data collection and eventually leads to managing messages, recipients, and delivery media, as we'll see later in this article. But all this revolves around the conditions defined for the checks, and this is the main business of triggers.

Understanding trigger expressions

Triggers are quite simple to create and configure—choose a name and a severity, define a simple expression using the expression form, and you are done. The expression form, accessible through the Add button, lets you choose an item, a function to perform on the item's data, and some additional parameters, and gives an output as shown in the following screenshot:

You can see how there's a complete item key specification, not just the name, to which a function is applied. The result is then compared to a constant using the greater than operator. The syntax for referencing the item keys is very similar to that of a calculated item. In addition to this basic way of referring to item values, triggers also add a comparison operator that wraps all the calculations up to a Boolean expression. This is the one great unifier of all triggers; no matter how complex the expression, it must always return either a True value or a False value. This value is of course directly related to the state of a trigger, which can only be OK if the expression evaluates to False, or PROBLEM, if the expression evaluates to True. There are no intermediate or soft states for triggers.

A trigger can also be in an UNKNOWN state if it's impossible to evaluate the trigger expression (because one of the items has no data, for example).

A trigger expression has two main components:

  • Functions applied to the item data
  • Arithmetical and logical operations performed on the functions' results.

From a syntactical point of view, the item and function component has to be enclosed in curly brackets, as illustrated in the preceding screenshot, while the arithmetical and logical operators stay outside the brackets.

Selecting items and functions

You can reference as many items as you want in a trigger expression, as long as you apply a single function to every single item. This means that if you want to use the same item twice, you'll need to specify it twice completely, as shown in the following code:

{Alpha:log[/tmp/operations.log,,,10,skip].nodata(600)}=1 | {Alpha:log[/tmp/operations.log,,,10,skip].str(error)}=1

The previously discussed trigger will evaluate to PROBLEM if there are no new lines in the operations.log file for more than ten minutes, or if an error string is found in the lines appended to that same file.

Zabbix doesn't apply short-circuit evaluation of the and and or ( & and | ) operators—every comparison will be evaluated regardless of the outcome of the preceding ones.

Of course, you don't have to reference items from the same host; you can reference different items from different hosts, and on different nodes too, if you can access them, as shown in the following code:

{Node1:Alpha:agent.ping.last(0)}=0 & {Node2:Beta:agent.ping.last(0)}=0

Here, the trigger will evaluate to PROBLEM if both the hosts Alpha and Beta are unreachable. It doesn't matter that the two hosts are monitored by two different nodes. Everything will work as expected as long as the node where the trigger is defined has access to the two monitored hosts' historical data. In other words, the trigger has to be configured on a master node that receives data from both Node1 and Node2.

You can apply all the same functions available for calculated items to your items' data. The complete list and specification is available on the official Zabbix wiki (https://www.zabbix.com/documentation/2.0/manual/appendix/triggers/functions), so it would be redundant to repeat them here, but a few common aspects among them deserve a closer look.

Choosing between seconds or number of measurements

Many trigger functions take a sec or #num argument. This means that you can either specify a time period in seconds, or a number of measurements, and the trigger will take all of the item's data in the said period, and apply the function to it. So, the following code will take the minimum value of Alpha's CPU idle time in the last ten minutes:

{Alpha:system.cpu.util[,idle].min(600)}

The following code, unlike the previous one, will perform the same operation on the last ten measurements:

{Alpha:system.cpu.util[,idle].min(#10)}

Will perform the same operation on the last ten measurements.

Instead of a value in seconds, you can also specify things such as 10m for ten minutes, 2d for two days, and 6h for six hours.

Which one should you use in your triggers? While it obviously depends on your specific needs and objectives, each one has its strengths that makes it useful in the right context. For all kinds of passive checks initiated by the server, you'll often want to stick to a time period expressed as an absolute value. A #5 parameter will vary quite dramatically as a time period if you vary the check interval of the relative item. It's not usually obvious that such a change will also affect related triggers. Moreover, a time period expressed in seconds may be closer to what you really mean to check, and thus may be more easy to understand when you visit the trigger definition at a later date. On the other hand, you'll often want to opt for the #num version of the parameter for many active checks, where there's no guarantee that you will have a constant, reliable interval between measurements. This is especially true for trapper items of any kind, and for log files. With these kinds of items, referencing the number of measurements is often the best option.

Date and time functions

All the functions that return a time value, whether it's the current date, the current time, the day of the month, or the day of the week, still need a valid item as part of the expression. These can be useful to create triggers that may change status only during certain times of day, or during certain specific days, or better yet, to define well known exceptions to common triggers, when we know that some otherwise unusual behavior is to be expected. For example, a case where there's a bug in one of your company's applications that causes a rogue process to quickly fill up a filesystem with huge log files. While the development team is working on it, they ask you to keep an eye on the said filesystem, and kill the process if it's filling the disk up too quickly. Like with many things in Zabbix, there's more than one way to approach this problem, but you decide to keep it simple and find that, after watching the trending data on the host's disk usage, a good indicator that the process is going rogue is that the filesystem has grown by more than 3 percent in 10 minutes:

{Alpha:vfs.fs.size[/var,pused].delta(600)}>3

The only problem with this expression is that there's a completely unrelated process that makes a couple of big file transfers to this same filesystem every night at 2 a.m. While this is a perfectly normal operation, it could still make the trigger switch to a PROBLEM state and send an alert. Adding a couple of time functions will take care of that, as shown in the following code:

{Alpha:vfs.fs.size[/var,pused].delta(600)}>3 & ({Alpha:vfs.fs.size[/var,pused].time(0)}<020000 | {Alpha:vfs.fs.size[/var,pused].time(0)}>030000 )

Just keep in mind that all the trigger functions return a numerical value, including the date and time ones, so it's not really practical to express fancy dates like "the first Tuesday of the month" or "last month" (instead of the last 30 days).

Trigger severity

Severity is little more than a simple label that you attach to a trigger. The web frontend will display different severity values with different colors, and you will be able to create different actions based on them, but they have no further meaning or function in the system. This means that the severity of a trigger will not change over time based on how long that trigger has been in a PROBLEM state, nor can you assign a different severity to different thresholds in the same trigger. If you really need a warning alert when a disk is over 90 percent full, and a critical alert when it's 100 percent full, you will need to create two different triggers with two different thresholds and severity. This may not be the best course of action though, as it could lead to either warnings that are ignored and not acted upon, or critical warnings that will fire up when it's already too late and you have already lost service availability, or even just a redundant configuration with redundant messages and more possibilities for mistakes, or an increased signal-to-noise ratio.

A better approach would be to clearly assess the actual severity of the possibility for the disk to fill up, and create just one trigger with a sensible threshold, and possibly an escalating action if you fear that the warning could get lost among the others.

Choosing between absolute values and percentages

If you look at many native agent items, you'll see that a lot of them can express measurements either as absolute values or as percentages. It often makes sense to do the same while creating one's own custom items, as both representations can be quite useful in and on themselves. When it comes to creating triggers on them though, the two can differ quite a lot, especially if you have the task to keep track of available disk space.

Filesystems' sizes and disk usage patterns vary quite a lot between different servers, installations, application implementations, and user engagement. While a free space of 5 percent of a hypothetical disk A could be small enough that it would make sense to trigger a warning and act upon it, the same 5 percent could mean a lot more space for a large disk array, enough that you don't really need to act immediately, but can plan a possible expansion without any urgency. This may lead you to think that percentages are not really useful in these cases, and even that you can't really put disk-space-related triggers in templates, as it would be better to evaluate every single case and build triggers that are tailor-made for every particular disk with its particular usage pattern. While this can certainly be a sensible course of action for particularly sensible and critical filesystems, it can quickly become too much work in a large environment where you may need to monitor hundreds of different filesystems.

This is where the delta function can help you create triggers that are general enough that you can apply them to a wide variety of filesystems, so that you can still get a sensible warning about each one of them. You will still need to create more specialized triggers for those special, critical disks, but you'd have to anyway.

While it's true that the same percentages may mean quite a different thing for disks with a great difference in size, a similar percentage variation of available space on a different disk could mean quite the same thing; the disk is filling up at a rate that can soon become a problem:

{Template_fs:vfs.fs.size[/,pfree].last(0)}<5 & ({Template_fs:vfs.fs.size[/,pfree].delta(1d)} / {Template_fs:vfs.fs.size[/,pfree].last(0,1d) } > 0.5)

The previously discussed trigger would report a PROBLEM state not just if the available space is less than 5 percent on a particular disk, but also if the available space has been reduced by more than half in the last 24 hours (don't miss the time-shift parameter in the last function). This means that no matter how big the disk is, based on its usage pattern, it could quickly fill up very soon. Note also how the trigger would need progressively smaller and smaller percentages for it to assume a PROBLEM state, so you'd automatically get more frequent and urgent notifications as the disk is filling up.

For these kinds of checks, percentage values should prove more flexible and easy to understand than absolute ones, so that's what you probably want to use as a baseline for templates. On the other hand, absolute values may be your best option if you want to create a very specific trigger for a very specific filesystem.

Understanding operations as correlations

As you may have already realized, practically every interesting trigger expression is built as a logical operation between two or more simpler expressions. Naturally, it is not that this is the only way to create useful triggers. Many simple checks on the status of an agent.ping item can literally save the day when quickly acted upon, but Zabbix also makes it possible, and relatively easy, to define powerful checks that would require a lot of custom coding to implement in other systems. Let's see a few more examples of relatively complex triggers.

Going back to the date and time functions, let's say that you have a trigger that monitors the number of active sessions in an application, and fires up an alert if that number drops too low during certain hours, because you know that there should always be a few automated processes creating and using sessions in that window of time (from 10:30 to 12:30 in this example). During the rest of the day, the number of sessions is neither predictable nor that significant, so you keep sampling it but don't want to receive any alert. A first, simple version of your trigger could look like the following code:

{Appserver:sessions.active[myapp].min(300)}<5 & {Appserver:sessions.active[myapp].time(0)} > 103000 & {Appserver:sessions.active[myapp].time(0) } < 123000

The session.active item key could reference a custom script, a calculated item or anything else. It's used here as a label to make the example easier to read, and not as an instance of an actual ready-to-use native item.

The only problem with this trigger is that if the number of sessions drops below five in that window of time, but it doesn't come up again until after 12:30, the trigger will stay in the PROBLEM state until the next day. This may be a great nuisance if you have set up multiple actions and escalations on that trigger, as they would go on for a whole day no matter what you do to address the actual sessions problems. But even if you don't have escalating actions, you may have to give accurate reports on these event durations, and an event that looks like it's going on for almost 24 hours would be both incorrect in itself and for any SLA reporting. Even if you don't have reporting concerns, displaying a PROBLEM state when it's not there anymore is a kind of false positive that will not let your monitoring team focus on the real problems, and over the time, may reduce their attention on that particular trigger.

A possible solution is to make the trigger return to an OK state outside the target hours, if it was in a PROBLEM state, as shown in the following code:

({Appserver:sessions.active[myapp].min(300)}<5 & {Appserver:sessions.active[myapp].time(0)} > 103000 & {Appserver:sessions.active[myapp].time(0) } < 123000)) | ( {TRIGGER.VALUE}=1 & {Appserver:sessions.active[myapp].min(300)}<0 & ({Appserver:sessions.active[myapp].time(0)} < 103000 | {Appserver:sessions.active[myapp].time(0) } > 123000) )

The first three lines are identical to the trigger defined before. This time there is one more complex condition, as follows:

  • The trigger is in a PROBLEM state (see the note about the TRIGGER.VALUE macro)
  • The number of sessions is less than zero (this can never be true)
  • We are outside the target hours (the last two lines are the opposite of those defining the time frame preceding it)

    The TRIGGER.VALUE macro represents the current value of the trigger expressed as a number. A value of 0 means OK, 1 means PROBLEM, and 2 means UNKNOWN. The macro can be used anywhere you could use an item.function pair, so you'll typically enclose it in curly brackets. As you've seen in the preceding example, it can be quite useful when you need to define different thresholds and conditions depending on the trigger's status itself.

The condition about the number of sessions being less than zero makes sure that outside the target hours, if the trigger was in a PROBLEM state, the whole expression will evaluate to false anyway. False means the trigger is switching to an OK state.

Here, you have not only made a correlation between an item value and a window of time to generate an event, but you have also made sure that the event will always spin down gracefully instead of potentially going out of control.

Another interesting way to build a trigger is to combine different items from the same hosts, or even different items from different hosts. This is often used to spot incongruities in your systems' state that would otherwise be very difficult to identify.

An obvious case could be that of a server that serves content over the network. Its overall performance parameters may vary a lot depending on a great number of factors, and so it would be very difficult to identify sensible trigger thresholds that wouldn't generate a lot of false positives, or even worse, missed events. What may be certain though is that if you see a high CPU load while network traffic is low, then you may have a problem, as shown in the following code:

{Alpha:system.cpu.load[all,avg5].last(0)} > 5 & {Alpha:net.if.total[eth0].avg(300)} < 1000000

An even better example would be the necessity to check for hanging or freezed sessions in an application. The actual way to do it would depend a lot on the specific implementation of the said application, but for illustrative purposes, let's say that a frontend component keeps a number of temporary session files in a specific directory, while the database component populates a table with the session data. Even if you have created items on two different hosts to keep track of these two sources of data, each number taken alone will certainly be useful for trending analysis and capacity planning, but they need to be compared to check if something's wrong in the application's workflow. Assuming that we have previously defined a local command on the frontend's Zabbix agent that will return the number of files in a specific directory, and an odbc item on the database host that will query the DB for the number of active sessions, we could then build a trigger that compares the two values and report a PROBLEM state if they don't match:

{Frontend:dir.count[/var/sessions].last(0)} # {Database:sessions.count.last(0)}

The # term in the expression is the not equal operator.

Aggregated and calculated items can also be very useful for building effective triggers. The following one will make sure that the ratio between active workers and the available servers doesn't drop too low in a grid or cluster:

{ZbxMain:grpsum["grid", "proc.num[listener]", last, 0].last(0)} / {ZbxMain:grpsum["grid", "agent.ping", last, 0].last(0)} < 0.5

All these examples should help drive home the fact that once you move beyond checking for simple thresholds with single item values, and start correlating different data sources together in order to have more sophisticated and meaningful triggers, there is virtually no end to all the possible variations of trigger expressions that you can come up with.

By identifying the right metrics and combining them in various ways, you can pinpoint very specific aspects of your systems' behavior; you can check log files together with the login events and network activity to track down possible security breaches, compare a single server performance with the average server performance in the same group to identify possible problems in service delivery, and much more.

This is, in fact, one of Zabbix's best kept secrets that really deserves more publicity; its triggering system is actually a sophisticated correlation engine that draws its power from a clear and concise method to construct expressions as well as from the availability of a vast collection of both current and historical data. Spending a bit of your time studying it in detail, and coming up with interesting and useful triggers tailor-made for your needs will certainly pay you back tenfold, as you will end up not only with a perfectly efficient and intelligent monitoring system, but also with a much deeper understanding of your environment.

Managing the trigger dependencies

It's quite common that the availability of a service or a host doesn't depend only on the said host in itself, but also on the availability of any other machine that may provide connectivity to it. For example, if a router goes down isolating an entire subnet, you would still get alerts about all the hosts in the said network that will suddenly be seen as unavailable from Zabbix's point of view, even if it's really the router's fault. A dependency relationship between the router and the hosts behind it would help alleviate the problem, because it would make the server skip any trigger check for the hosts in the subnet, should the router become unavailable. While Zabbix doesn't support the kind of host-to-host dependencies that other systems do, it does have a trigger-to-trigger dependency feature that can largely perform the same function. For every trigger definition, you can specify a different trigger upon which your new trigger is dependent. If the parent trigger is in a PROBLEM state, the trigger you are defining won't be checked, until the parent returns to an OK state. This approach is certainly quite flexible and powerful, but it also has a couple of downsides. The first one is that one single host can have a significant number of triggers, so if you want to define a host-to-host dependency, you'll need to update every single trigger, which may prove to be quite a cumbersome task. You can of course, rely on the mass update feature of the web frontend as a partial workaround. A second problem is that you won't be able to look at a host definition and see that there is a dependency relationship with another host. Short of looking at a host's trigger configuration, there's simply no easy way to display or visualize this kind of relationship in Zabbix.

A distinct advantage of having a trigger-level dependency feature is that you can define dependencies between single services on different hosts. As an example, you could have a database that serves a bunch of web applications on different web servers. If the database is unavailable, none of the related websites will work, so you may want to set up a dependency between the web monitoring triggers and the availability of the database. On the same servers, you may also have some other service that relies on a separate license server, or an identity and authentication server. You could then set up the appropriate dependencies, so that you could end up having some triggers depend on the availability of one server, and other triggers depend on the availability of another one, all in the same host. While this kind of configuration can easily become quite complex and difficult to maintain efficiently, a select few, well-placed dependencies can help cut down the amount of redundant alerts in a large environment. This in turn would help you focus immediately on the real problems where they arise, instead of having to hunt them down in a long list of trigger alerts.

Mastering Zabbix Monitor your large IT environment efficiently with Zabbix with this book and ebook
Published: December 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Taking action

Just as items only provide raw data, and triggers are independent from them as they can access virtually any item's historical data, triggers in turn only provide a status change. This change is recorded as an event, just like measurements are recorded as item data. This means that triggers don't provide any reporting functionality; they just check their conditions and change the status accordingly. Once again, what may seem like a limitation and a lack of power turns out to be the exact opposite, as the Zabbix component in charge of actually sending out alerts, or trying to automatically resolve some problems, is completely independent from triggers. This means that just like triggers can access any item's data, actions can access any trigger's name, severity, or status, so that once again you can create the perfect mix of very general and very specific actions, without being stuck in a one-action-per-trigger scheme.

Unlike triggers, actions are also completely independent from hosts and templates too. Every action is always globally defined, and its conditions checked against every single Zabbix event. As you'll see in the following paragraphs, this may force you to create some explicit conditions, instead of implicit conditions, but that's balanced out by the fact that you won't have to create similar but different actions for similar events just because they are related to different hosts.

An action is composed of the following three different parts that work together to provide all the functionality needed:

  • Action definition
  • Action operations

The fact that every action has a global scope is reflected in every one of its components, but it assumes a critical importance when it comes to action conditions, as it's the place where you decide which action should be executed based on which events. But let's not get ahead of ourselves and let's see a couple of interesting things about each component in turn.

Defining an action

This is where you decide a name for the action, and can define a default message that can be sent as a part of the action itself. In the message, you can reference specific data about the event, such as the host, item and trigger name, item and trigger values, and URLs. Here, you can leverage the fact that actions are global by using macros, so that a single action definition could be used for every single event in Zabbix, and yet provide useful Mastering Zabbixrmation in its message.

You can see a few interesting macros already present in the default message when you create a new action, as shown in the following screenshot:

Most of them are pretty self-explanatory, but it's interesting to see how you can of course reference a single trigger, the one that generated the event. On the other hand, as a trigger can check on multiple items from multiple hosts, you can reference all the hosts and items involved (up to nine different hosts and/or items) so that you can get a picture of what's happening by just reading the message.

Other interesting macros can make the message even more useful and expressive. Just remember that the default message may be sent not only via e-mail, but also via a chat program or an SMS; you'll probably want to create different default actions with different messages for different media types, so that you can calibrate the amount of Mastering Zabbixrmation provided based on the media available.

You can see the complete list of supported macros in the official documentation wiki at https://www.zabbix.com/documentation/2.0/manual/appendix/macros/supported_by_location, so we'll look at just a couple of the most interesting ones.

The {EVENT.DATE} and {EVENT.TIME} macros

These two macros can help differentiate between the time a message is sent and the time of the event itself. It's particularly useful not only for repeated or escalated actions, but also for all media where a timestamp is not immediately apparent.

The {INVENTORY.SERIALNO.A} and friends macros

When it comes to hardware failure, Mastering Zabbixrmation about a machine's location, admin contact, serial number, and so on, can prove quite useful to track it down quickly, or to pass it on to external support groups.

The {NODE.ID} and {NODE.NAME} macros

These are quite important in a distributed architecture. If you have thousands of monitored hosts managed by different nodes, it may not always be immediately apparent which host is monitored by which node. These macros can really help avoid wasting time just looking for the right node in order to investigate the event.

Defining the action conditions

This part lets you define conditions based on the event's hosts, trigger, and trigger values. Just like trigger expressions, you can combine different simple conditions with a series of AND/OR logical operators, as shown in the following screenshot. Unlike trigger expressions, there is not much flexibility in how you combine them. You can either have all AND, all OR, or a combination of the two, where conditions of different types are combined with AND, while conditions of the same type are combined with OR:

Observe how one of the conditions is Trigger value = "PROBLEM". Since actions are evaluated for every event, and a trigger switching from PROBLEM to OK is an event in itself, if you don't specify this condition, the action will be executed both when the trigger switches to PROBLEM and when the trigger switches back to OK. Depending on how you have constructed your default message and what operations you intend to do with your actions, this may very well be what you intended, and Zabbix will behave exactly as expected.

Anyway, if you created a different recovery message in the Action definition form, and you forget the condition, you'll get two messages when a trigger switches back to OK—one will be the standard message, and one will be the recovery message. This can certainly be a nuisance as any recovery message would be effectively duplicated, but things can get ugly if you are relying on external commands as part of the action's operations. If you forget to specify the condition Trigger value = "PROBLEM", the external, remote command would also be executed twice; once when the trigger switches to PROBLEM (this is what you intended), and once when it switches back to OK (this is quite probably not what you intended). Just to be on the safe side, and if you don't have very specific needs for the action you are configuring, it's probably better if you get into the habit of putting Trigger value = "PROBLEM" for every new action you create, or at least check that it's present in the actions you modify.

The most typical application for creating different actions with different conditions is to send alert and recovery messages to different recipients. This is the part where you should remember that actions are global.

Let's say that you want all the database problems sent over to the DB Administrators group and not the default Zabbix Administrators group. If you just create a new action with the condition that the host group must be DB Instances, and as message recipients, choose your DB Admins, what will happen is that they will certainly receive a message for any DB related event, but so will your Zabbix Admins, if the default action has no conditions configured. The reason is that since actions are global, they are always executed whenever their conditions evaluate to True. In this case, both the specific action and the default one would evaluate to True, so both groups would receive a message. What you could do is add an opposite condition in the default action so that it would be valid for every event, except for those related to the DB Instances host group. The problem is that this approach can quickly get out of control, and you may find yourself with a default action full of the not in group conditions. Truth is, once you start creating actions specific for message recipients, you either disable the default action or take advantage of it to populate a message archive for administration and reporting purposes.

Choosing the action operations

If the first two parts were just preparation, this is where you tell the action what it should actually do. The following are two main aspects to this:

  • Operation steps
  • The actual operations available for each step

As with almost everything else in Zabbix, the simplest cases that are very straightforward are most often self-explanatory; you just have a single step, and this step consists of sending the default message to a group of defined recipients. As with almost everything else in Zabbix, this simple scenario can become increasingly complex and sophisticated, but still manageable, depending on your specific needs. Let's see a few interesting details about each part.

Steps and escalations

Even if an action is tied to a single event, it does not mean that it can perform a single operation. In fact, it can perform an arbitrary number of operations, called steps, which can even go on for an indefinite amount of time, or until the conditions for performing the action are not valid anymore.

You can use multiple steps to both send messages as well as perform some kind of automated operations. Or, you can use the steps to send alert messages to different groups, or even multiple times to the same group, with the time intervals that you want, as long as the event is unacknowledged, or even not yet resolved. The following screenshot shows a combination of different steps:

As you can see, step 1 starts immediately, and is set to send a message to a user group, and then delays the subsequent step by just one minute. After one minute, step 2 starts and is configured to perform a remote command on the host. As step 2 has a default duration (the duration of which is defined in the main Action definition tab), step 3 will start after about an hour. Steps 3, 4, and 5 are all identical and have been configured together—they will send a message to a different user group every 10 minutes. You can't see it in the preceding screenshot, but step 6 will only be executed if the event is not yet acknowledged, just as step 7 that is still being configured. The other interesting bit of step 7 is that it's actually set to configure steps 7 to 0. It may seem against intuition, but in this case, step 0 simply means "forever". You can't really have further steps if you create a step N to 0, because the latter will repeat itself with the time interval set in the step's Duration(sec) field. Be very careful in using a step 0 because it will really go on until the trigger's status changes. Even then, if you didn't add a Trigger status="PROBLEM" condition to your action, a step 0 can be executed even if the trigger switched back to OK. In fact, it's probably best never to use step 0 at all, unless you really know what you are doing.

Messages and media

For every message step, you can choose to send the default message that you configured in the first tab of the Action creation form, or send a custom message that you can craft in exactly the same way as the default one. You might want to add more details about the event if you are sending the message via e-mail to a technical group, or reduce the amount of details, or the wording of the message, if you are sending it to a manager or supervisor, or if you are limiting the message to an SMS.

Remember that in the Action operation form, you can only choose recipients as Zabbix users and groups, while you still have to specify for every user any media address they are reachable to. This is done in the Administration tab of the Zabbix frontend by adding media instances for every single user. You also need to keep in mind that every media channel can be enabled or disabled for a user, or it may be active only during certain hours of the day, or just for one or more specific trigger severity, as shown in the following screenshot:

This means that even if you configure an action to send a message, some recipients may still not receive it based on their own media configuration.

While Email,Jabber, and SMS are the default options for sending messages, you still need to specify how Zabbix is supposed to send them. Again, this is done in the Media types section of the Administration tab of the frontend. You can also create new media types there, which will be made available both in the media section of user configuration, and as targets for message-sending in the Action operations form.

A new media type can be a different e-mail, jabber, or an SMS server, in case you have more than one, and you need to use them for different purposes or with different sender identifications. It can also be a script, and this is where things can become interesting if potentially misleading.

A custom media script has to reside on the Zabbix server in the directory indicated by the AlertScriptPath variable of zabbix_server.conf. When called upon, it will be executed with the following three parameters passed by the server:

  • $1: The recipient of the message
  • $2: The subject of the message
  • $3: The body of the main message

The recipient will be taken from the appropriate user-media property that you would have defined for your users while creating the new media type. The subject and the message body will be the default ones configured for the action, or some step-specific ones, as explained before. Then, from Zabbix's point of view, the script should send the message to the recipient by whatever custom methods you intend to use, whether it's an old UUCP link, or a modern mail server that requires strong authentication, or a post to an internal microblogging server. The fact is, you can actually do what you want with the message; you can simply log it to a directory, send it to a remote file server, morph it to a syslog entry and send it over to a log server, run a speech synthesis program on it and read it aloud on some speakers, or record a message on an answering machine; the sky's the limit with custom media types. This is why you should not confuse a custom media with the execution of a remote command—while you could potentially obtain roughly the same results with one or the other. Custom media scripts and remote commands are really two different things.

Remote commands

These are normally used to try to perform some corrective actions in order to resolve a problem without human intervention. After you've chosen the target host that should execute the command, the Zabbix server will connect to it and ask it to perform it. If you are using the Zabbix agent as a communication channel, you'll need to set EnableRemoteCommands to 1, or the agent will refuse to execute any command. Other possibilities are SSH, telnet, or IPMI (if you have compiled the relative options during server installation).

Remote commands can be used to do almost anything—kill or restart a process, make space on a filesystem by zipping or deleting old files, reboot a machine, and so on. They tend to seem powerful and exciting to new implementers, but in the authors' experience, they tend to be fragile solutions that can break things almost as often as they fix it. It's harder than it looks to make them run safely, without accidentally deleting files, or rebooting servers when there's no need to. The real problem with remote commands is that they tend to hide problems instead of revealing them, which should really be the job of a monitoring system. Yes, they can prove useful as a quick patch to ensure the smooth operation of your services, but use them too liberally and you'll quickly forget that there actually are recurring problems that need to be addressed, because some fragile command somewhere is trying to fix things in the background for you. It's usually better to really try to solve a problem than to just hide it behind some automated temporary fix, and not just from a philosophical point of view, but because when these patches fail, they tend to fail spectacularly and with disastrous consequences.

So our advice is to use remote commands very sparingly, and only if you know what you are doing.

Summary

This article focused on what is usually considered the "core business" of a monitoring system—its triggering and alerting features. By concentrating separately and alternately on the two parts that contribute to this function: triggers and actions, it should be clear to you how once again Zabbix's philosophy of separating all the different functions can give great rewards to the astute user. You should have learned how to create complex and sophisticated trigger conditions that will help you have a better understanding of your environment, and more control over what alerts you should receive. The various triggering functions and options, as well as some of the finer aspects of item selection, along with the many aspects of action creation should not be a secret to you now.

Resources for Article:


Further resources on this subject:


Mastering Zabbix Monitor your large IT environment efficiently with Zabbix with this book and ebook
Published: December 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

About the Author :


Andrea Dalle Vacche

Andrea Dalle Vacche is a highly skilled IT Professional with over 12 years of industry experience. He graduated from Univerista' degli Studi di Ferrara with an Information Technology certification. This laid the technology foundation, which Andrea has built on ever since. He has acquired various other industry respected accreditations, which include Cisco, Oracle, RHCE, ITIL, and of course Zabbix. Throughout his career he has worked on many large-scale environments, often in roles which have been very complex on a consultant basis. This has further enhanced his growing skill set, adding to his practical knowledge base and concreting his appetite for theoretical technical study. His love for Zabbix came from his time spent in the Oracle world as a Database Administrator/Developer. His time was spent mainly reducing "ownership costs" with specialization in monitoring and automation. This is where he came across Zabbix and the flexibility, both technically and administratively, it offered. Using this as a launch pad, it inspired Andrea to develop Orabbix, the first open source software to monitor Oracle completely integrated with Zabbix.

Andrea has published a number of articles on Zabbix-related software such as DBforBIX. His projects are publicly available on his website http://www.smartmarmot.com. Currently, Andrea is working for a leading global investment bank in a very diverse and challenging environment. His involvement is vast and deals with many aspects of the Unix/Linux platforms as well as paying due diligence to many different kinds of third-party software, which are strategically aligned to the bank's technical roadmap.

Stefano Kewan Lee

Stefano Kewan Lee is an IT Consultant with 10 years of experience in system integration, security, and administration. He is a certified Zabbix specialist in Large Environments, holds a Linux administration certification from the LPI, and a GIAC GCFW certification from SANS Institute. When he's not busy breaking websites, he lives in the countryside with two cats and two dogs and practices martial arts.

Books From Packt


 Zabbix 1.8 Network Monitoring
Zabbix 1.8 Network Monitoring

Cacti 0.8 Network Monitoring
Cacti 0.8 Network Monitoring

Icinga Network Monitoring
Icinga Network Monitoring

 Zenoss Core 3.x Network and System Monitoring
Zenoss Core 3.x Network and System Monitoring

Oracle VM Manager 2.1.2
Oracle VM Manager 2.1.2

Tcl 8.5 Network Programming
Tcl 8.5 Network Programming

 Zenoss Core Network and System Monitoring
Zenoss Core Network and System Monitoring

 Python Network Programming Cookbook
Python Network Programming Cookbook


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
q
4
B
r
r
G
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software