Saturday, May 21, 2011

Introduction to Zabbix

Inspite of Zabbix celebrating its 10th anniversary in 2011 and it being one of the best monitoring frameworks out there, it looses out very badly in popularity contest. While one can conjure up many reasons for that, I would like to focus my energies on "spreading the word on Zabbix and mikoomi". So with that I will dedicate the next few blog articles to getting you familiar with Zabbix and mikoomi.

Ok, so the first question is, what is Zabbix? The straightforward answer is that Zabbix is a monitoring solution that is highly extensible. Using Zabbix and a few other pieces of software, you can rely on it to monitor your servers, network and other IT infrastructure and alert you when something goes wrong.

However, it is more than just a monitoring solution and has many ways to extend its capabilities - and that's why I refer to it as a framework. You can write "plugins" or "agents" to monitor databases, applications, environmental controls, security, industrial processing, and anything else that you can imagine and write a monitoring program for. The ease of extensibility and its simple, efficient, scalable and distributed architecure, web-based UI for monitoring and configuration, easy reporting and graphing capabilities and alerting and notification capabilities are some of the things that get me really excited and passionate about Zabbix.  You can get more details from the Zabbix documentation pages


Architecture
From an architecture perspective a Zabbix deployment consists of one or more Zabbix servers or nodes (server and node are used synonymously). A node can be configured as a standalone server, a proxy server or a node server in a hierarchical / distributed setup (note distributed and hierarchical are used synonymously). Due to its elegant architecture, a Zabbix server - configured in a standalone or hierarchical/distributed setup functions in the same way as a standalone "Zabbix Server" with the only variation being that in a hierarchial setup, a child node server sends monitoring data to its parent node. In a distributed setup, the top-level parent node is usually referred to as the master node. There is also a proxy server, which is a scaled-down Zabbix server without the web interface. The diagrams below shows a standalone and hierarchical setup of Zabbix.


Nodes in a hierarchical setup function independently and are resilient to network disruptions and continue to function until network connectivity between the different nodes is restored - upon which data that was collected by an intermediate node during the interruption is sent to its parent node. An intermediate node has one and only one parent node, but a parent node can have one or more child nodes. A parent node stores configuration, monitoring and event data for itself as well as its children nodes. Consequently configuration changes for a node can be done on the node itself or its parent or one of its ancestors. Configuration data changes between a master and a child node is exchanged every 120 seconds. A child node does not receive its parent's or ancestor's configuration, but a child sends its configuration data to its parent and this configuration data percolates upwards upto the top-level master node. Event and monitoring data flows from a child to its parent node every 10 seconds. The data from a child node flows up the hierarchy till it reaches the top-level root node of the hierarchy. Besides "server", the Zabbix architecture also includes a Zabbix agent which is a platform specific executable / binary file responsible for collection monitoring data about the machine/server that it is installed on. Monitoring data can be collected by the Zabbix server itself, a Zabbix agent, a proxy Zabbix server or some external script running on a Zabbix server or proxy server as shown in the diagram below.


Processes
From a process perspective, a Zabbix server is implemented by the binary zabbix_server (or zabbix-server) which spawns the following daemons or background processes:

WatchDog
A watchdog daemon is responsible for the spawning and terminating of Zabbix processes. It spawns processes as needed and then ensures their graceful shutdown.

Housekeeper
A housekeeper daemon is responsible for removing/purging expired or aged monitoring, event, alert and alarm data. It periodically wakes up and performs housecleaning actions with the interval (in hours) being defined in the Zabbix server configuration file.

Poller
A poller daemon is responsible for collecting monitoring data (data can be pulled by Zabbix server or pushed into a Zabbix server as explained later). The poller is responsible for collecting data at regular intervals as defined in the host or template configuration. There are usually several poller daemons running in a Zabbix server - and their number can be configured in the Zabbix configuration file.

HTTPPoller
Zabbix has a sophisticated mechanism to not only monitor a "web resource" but also to simulate simple "web transaction". This is done by the HTTPPoller daemon.

Discoverer
The discoverer daemon is responsible for periodically scanning specified networks for new hosts and Zabbix agents (actually hosts/machines running the Zabbix agent).

Pinger
The pinger daemon is responsible for performing ICMP checks as needed.

DB Config Syncer
This daemon is responsible for periodically (every 60 seconds) refreshing the configuration cache (size = 8 MB) that is shared by the Zabbix daemons. Using a shared cache significantly reduces the load on the back-end database and improves performance and throughput of the Zabbix node.

DB Data Syncer
Monitoring data that is either pulled or pushed by the various daemons is stored in memory (size = 8 MB) instead of being written to the database immediately. The DB data syncer daemon is responsible for periodically (every 5 seconds) "flushing" this monitoring data to the back-end database with the objective to reduce load on the back-end database and improve overall performance.

Node Watcher
In a distributed configuration, this daemon is responsible for maintaining heart-beats with parent and children nodes.

Timer
Since most of the activities in a Zabbix server are time-based, the various daemons need to perform activities at periodic intervals. Since getting the time from the system clock is an expensive operation requiring context switch among other things, Zabbix maintains its own clock maintaining process called the timer.

Escalator
An alert daemon takes action when an event of interest happens and an action has been defined for the event. Trigger, event and action - all of which are related terms are described later.

Trapper
A trapper daemon listens on port 10051 and receives and processes monitoring data from Zabbix agent, the zabbix_sender process, zabbix proxy nodes and zabbix child nodes. The trapper process is a good way to feed bulk data to a Zabbix server.


Monitoring
An entity that is being monitored is referred to as a host in Zabbix. This underscores the network and infrastructure monitoring heritage of Zabbix. However Zabbix can be used to monitor anything from infrastructure components to databases, applications, services, web services, etc. A host in turn can have one or more monitoring attributes or items to be monitored. Each item can be independently monitored and there are various approaches to do that. Items for a host can be defined explicitly or can be inherited from a "template" that a host is associated with. A host can be associated with multiple templates and can have items that it inherits from the linked templates as well as have explicitly defined items.

In Zabbix terminology, monitoring data about items is referred to as "history" and is obtained either via "pull" or "push" mechanism. In the "pull" approach, the Zabbix server goes out and collects the monitoring data itself using one or more of its daemons. It can collect data using the following mechanisms -
  • Zabbix agent
  • SNMP
  • telnet
  • ssh
  • calculated
  • external script
The pull approach of monitoring is supported by the poller daemon processes and there are some in-built metrics available to monitor the effectiveness of the poller processes.

In the push approach, monitoring data for an item is sent to the Zabbix server by an "active Zabbix agent", Zabbix proxy server or an external script on the Zabbix server that executes the "zabbix_sender" executable. Note that in a distributed architecture, child nodes also push monitoring data to their parent node. The parent node in turn pushes its own as well as its child nodes' monitoring data to its parent node. Monitoring data is thus pushed up the hierarchy in a distributed setup. The push approach is supported by the trapper daemon process which listens on port 10051.

All valid monitoring data is first cached in memory and then flushed to the Zabbix database at regular intervals. As monitoring data is collected and/or received, it is compared with any "trigger" that may have been defined on the item or on a combination of items. If the monitored data satisfies a trigger condition (i.e. the trigger condition becomes true), then an event is generated. Furthermore, if an "action" has been defined for a trigger (often more generic actions are defined) than that action is executed by the Zabbix server. Items can be grouped together and the group is referred to as an application.

Besides explicitly defined/configured hosts by a user on a Zabbix server hosts can also be setup implicitly by a "discovery process". The Zabbix discoverer process periodically scans the network for IP addresses and certains TCP ports on discovered IPs. Rules can be defined to configure a newly "discovered" host on a specified network as a new host in Zabbix.


Templates
Templates are one of the most important concept in Zabbix that allow easy and flexible extensibility and inheritance. Essentially, a template is a special kind of a host that is defined with items, graphs and triggers. This host is not actually monitored but serves a a template and is available to be "inherited" by an actual host. When defining an actual host that is to be monitored, it can be linked to a template. The actual host then "inherits" the items, triggers and graphs defined in the template, and monitoring is done as defined in the template. A host can be linked to one or more hosts in which case the host inherits items, triggers and graphs from all the linked templates. Furthermore, a template can be linked to other templates thereby allow reusability and deeper inheritance. The screenshot below is an example of a Zabbix server within the Zabbix appliance being "self-monitored".



In Zabbix, one can "graph" any numeric (integer or float) monitoring data (or item) by just clicking on the monitored data. Furthermore, Zabbix also allows graphing several items together in a single graph - and such a "definition" at the template level is referred to as a graph (simple, eh?). Similarly, a trigger is essentially a condition defined on the values of monitored items. A trigger can be defined on the current value of an item, an expression involving several items or the comparative value of the current item with self or other items from the past or the comparison of an expression using current values and expression using historical values.


Plugins
Templates are essentially special kinds of hosts as described above. In many cases, monitoring an application, database or some other entity requires capabilities beyond what comes built-in with Zabbix. For example, monitoring a Java program or application may require a JMX client or monitoring an RDBMS may require a special script or executable. In this case, the collection of template and associated scripts is referred to as a "plugin". This is a term that I have coined to differentiate it from a template.

3 comments:

  1. Hi Jayesh

    I've spent a couple of days setting up the mongo plugin but I am having trouble. I know this is not the right forum but I can't find your email - is it possible that you email petros [at] diveris.org so I can discuss the problem with you? Many thanks, Petros

    ReplyDelete
  2. There is a problem with you php script not getting the oplog info correct. I'm not sure how to contact you, but I will post the corrected php here:

    http://pastie.org/3403054

    For this change to work the oplog item in the MongoDB template needs to be changed to this key oplog_rs_count.

    Hope this helps others.

    ReplyDelete
  3. Thanks Brian for the feedback. I'll update the php script.

    ReplyDelete