Nagios Service And Host Monitiring ================================== Shanker Balan http://shankerbalan.com shanu at shankerbalan dot com Changelog: * Fri Jan 9 13:41:59 IST 2004 - First cut [godzilla] ~> pkg_info -x nagios Information for nagios-1.1_4: Comment: Extremely powerful network monitoring system Description: Nagios is a host and service monitor designed to inform you of network problems before your clients, end-users or managers do. It has been designed to run under the Linux operating system, but works fine under most *NIX variants as well. The monitoring daemon runs intermittent checks on hosts and services you specify using external "plugins" which return status information to Nagios. When problems are encountered, the daemon can send notifications out to administrative contacts in a variety of different ways (email, instant message, SMS, etc.). Current status information, historical logs, and reports can all be accessed via a web browser. WWW: http://www.nagios.org/ Overview ======== The turn around time for fixing service interruptions and host failures have been fairly high. Despite stepping up monitoring on those servers which require special attention there are still times when we rely on client to call us up and inform us that a service or host is down. This obviously does not look too good from the client perspective who expects us to be proactive with support! Using tools like MRTG (http://www.mrtg.org), logcheck etc do help to some extent but are excellent for reporting purposes but really not suitable for alerting. They are not meant to be used as such to begin with. ** Feature Requirements ** Below is a list of features that we were looking for from a monitoring package: - Host monitoring for server reachability Basic ping tests to check whether the routers, switches and gateways are up and running. - Service monitoring Check whether the services are indeed listening and working as intended. Should be possible to carry out protocol actions like login to the POP server, retreive a HTML etc. - Notification (not flooding) by email, SMS etc for downtimes and recovery In case of problems, alert the respective admin of the host about it. The alert repeat count should be configurable. No need to flood inboxes with warnings. - Central Monitoring Server The monitoring station should be central for obvious reasons - Support Passive Checks Not all servers are publicly accessible. The monitoring tool should have support for pushing out updates to the central server. This is to enable monitoring of servers inside the client's LAN behind a firewall which are otherwise unreachable directly - Extensible via custom plugins The first tool (Nagios of course) which I came across on http://freshmeat.net did all this for me and much more. Some of the exciting features that Nagios offers are: - Ability to group hosts and assign distict contact list for alerts - Host and service dependency checks which makes it possible to establish relationships b/w hosts and servers. Installation ============ My first Nagios installation and configuration experience is on FreeBSD 5.2 -CURRENT. I have chosen to install Nagios without MySQL and "nagmin" support. [godzilla] ~# portinstall net/nagios RPMS for RedHat are available at http://dag.wieers.com/packages/nagios/. Use the nagios-1.1-5 packages as suggested on the site. See http://dag.wieers.com/home-made/apt/ for Apt packages. Configuration ============= Advice for Beginners (And they mean it) http://nagios.sourceforge.net/docs/1_0/beginners.html Nagios does not work out of the box!!!! "/usr/local/etc/rc.d/nagios start" will only spew out errors. Instead of breaking head, try the following approach to configuring Nagios: - Get the CGI interface working ### ### httpd.conf ### ScriptAlias /nagios/cgi-bin/ /usr/local/share/nagios/cgi-bin/ AllowOverride AuthConfig Options ExecCGI Order allow,deny Allow from all Alias /nagios/ /usr/local/share/nagios/ Options None AllowOverride AuthConfig Order allow,deny Allow from all ### ### etc/nagios/cgi.cfg ### # Disable auth for the moment. Makes testing easier. use_authentication=0 [godzilla] ~# apachectl restart && lynx http://localhost/nagios/ - cd /usr/local/etc/nagios && less *.cfg [godzilla] /usr/local/etc/nagios# ls *.cfg cgi.cfg escalations.cfg nagios.cfg checkcommands.cfg hostextinfo.cfg resource.cfg contactgroups.cfg hostgroups.cfg serviceextinfo.cfg contacts.cfg hosts.cfg services.cfg dependencies.cfg misccommands.cfg timeperiods.cfg Go through all of them. SLOWLY! Start with "hosts.cfg". - I had better success with starting with empty .cfg files than by tweaking the existing examples. - Keep a "tail -f /var/nagios/nagios.log" on a seperate terminal. The setup below is specific to my network. Change IP address as approriate. In the first phase, am setting up Nagios to monitor the workstation (godilla.mydomain.com with IP 192.168.1.24) its running on. [godzilla] ~# cd /usr/local/etc/nagios/ -- hosts.cfg ### ### hosts.cfg ### # define host{ name generic-host notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 register 0 } # My first workstation define host{ use generic-host host_name godzilla alias My Workstation address 192.168.1.24 check_command check-host-alive max_check_attempts 10 notification_interval 480 notification_period 24x7 notification_options d,u,r } -- hostgroups.cfg ### ### hostgroups.cfg ### define hostgroup{ hostgroup_name workstations alias My Workstations contact_groups admins members godzilla } -- contacts.cfg ### ### contacts.cfg ### define contactgroup{ contactgroup_name linux-admins alias Linux Administrators members shanu } define service{ name generic-service active_checks_enabled 1 passive_checks_enabled 1 parallelize_check 1 obsess_over_service 1 check_freshness 0 notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 register 0 }