Nagios Service And Host Monitiring
==================================
Shanker Balan
http://shankerbalan.com
shanu at shankerbalan dot com
Changelog:
* Fri Jan 9 13:41:59 IST 2004
- First cut
[godzilla] ~> pkg_info -x nagios
Information for nagios-1.1_4:
Comment:
Extremely powerful network monitoring system
Description:
Nagios is a host and service monitor designed to inform you of network
problems before your clients, end-users or managers do. It has been
designed to run under the Linux operating system, but works fine under
most *NIX variants as well. The monitoring daemon runs intermittent
checks on hosts and services you specify using external "plugins" which
return status information to Nagios. When problems are encountered, the
daemon can send notifications out to administrative contacts in a
variety of different ways (email, instant message, SMS, etc.). Current
status information, historical logs, and reports can all be accessed via
a web browser.
WWW: http://www.nagios.org/
Overview
========
The turn around time for fixing service interruptions and host failures
have been fairly high. Despite stepping up monitoring on those servers
which require special attention there are still times when we rely on
client to call us up and inform us that a service or host is down.
This obviously does not look too good from the client perspective who
expects us to be proactive with support!
Using tools like MRTG (http://www.mrtg.org), logcheck etc do help to
some extent but are excellent for reporting purposes but really not
suitable for alerting. They are not meant to be used as such to begin
with.
** Feature Requirements **
Below is a list of features that we were looking for from a monitoring
package:
- Host monitoring for server reachability
Basic ping tests to check whether the routers, switches and gateways
are up and running.
- Service monitoring
Check whether the services are indeed listening and working as
intended. Should be possible to carry out protocol actions like login
to the POP server, retreive a HTML etc.
- Notification (not flooding) by email, SMS etc for downtimes and
recovery
In case of problems, alert the respective admin of the host about it.
The alert repeat count should be configurable. No need to flood
inboxes with warnings.
- Central Monitoring Server
The monitoring station should be central for obvious reasons
- Support Passive Checks
Not all servers are publicly accessible. The monitoring tool should
have support for pushing out updates to the central server. This is to
enable monitoring of servers inside the client's LAN behind a firewall
which are otherwise unreachable directly
- Extensible via custom plugins
The first tool (Nagios of course) which I came across on
http://freshmeat.net did all this for me and much more. Some of the
exciting features that Nagios offers are:
- Ability to group hosts and assign distict contact list for alerts
- Host and service dependency checks which makes it possible to
establish relationships b/w hosts and servers.
Installation
============
My first Nagios installation and configuration experience is on FreeBSD
5.2 -CURRENT.
I have chosen to install Nagios without MySQL and "nagmin" support.
[godzilla] ~# portinstall net/nagios
RPMS for RedHat are available at http://dag.wieers.com/packages/nagios/.
Use the nagios-1.1-5 packages as suggested on the site.
See http://dag.wieers.com/home-made/apt/ for Apt packages.
Configuration
=============
Advice for Beginners (And they mean it)
http://nagios.sourceforge.net/docs/1_0/beginners.html
Nagios does not work out of the box!!!! "/usr/local/etc/rc.d/nagios
start" will only spew out errors. Instead of breaking head, try the
following approach to configuring Nagios:
- Get the CGI interface working
###
### httpd.conf
###
ScriptAlias /nagios/cgi-bin/ /usr/local/share/nagios/cgi-bin/
AllowOverride AuthConfig
Options ExecCGI
Order allow,deny
Allow from all
Alias /nagios/ /usr/local/share/nagios/
Options None
AllowOverride AuthConfig
Order allow,deny
Allow from all
###
### etc/nagios/cgi.cfg
###
# Disable auth for the moment. Makes testing easier.
use_authentication=0
[godzilla] ~# apachectl restart && lynx http://localhost/nagios/
- cd /usr/local/etc/nagios && less *.cfg
[godzilla] /usr/local/etc/nagios# ls *.cfg
cgi.cfg escalations.cfg nagios.cfg
checkcommands.cfg hostextinfo.cfg resource.cfg
contactgroups.cfg hostgroups.cfg serviceextinfo.cfg
contacts.cfg hosts.cfg services.cfg
dependencies.cfg misccommands.cfg timeperiods.cfg
Go through all of them. SLOWLY! Start with "hosts.cfg".
- I had better success with starting with empty .cfg files than by
tweaking the existing examples.
- Keep a "tail -f /var/nagios/nagios.log" on a seperate terminal.
The setup below is specific to my network. Change IP address as
approriate.
In the first phase, am setting up Nagios to monitor the workstation
(godilla.mydomain.com with IP 192.168.1.24) its running on.
[godzilla] ~# cd /usr/local/etc/nagios/
-- hosts.cfg
###
### hosts.cfg
###
#
define host{
name generic-host
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
}
# My first workstation
define host{
use generic-host
host_name godzilla
alias My Workstation
address 192.168.1.24
check_command check-host-alive
max_check_attempts 10
notification_interval 480
notification_period 24x7
notification_options d,u,r
}
-- hostgroups.cfg
###
### hostgroups.cfg
###
define hostgroup{
hostgroup_name workstations
alias My Workstations
contact_groups admins
members godzilla
}
-- contacts.cfg
###
### contacts.cfg
###
define contactgroup{
contactgroup_name linux-admins
alias Linux Administrators
members shanu
}
define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
}