Torque
TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project and, with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.
Contents |
Overview
Torque is the logical successor to PBS/OpenPBS. I had a prior experience with Condor when Open Source PBS was just coming around. Now that PBS is kickass, it makes sense to just use Torque and the myraid of schedulers available for it.
Installation
OS Environment
- Fedora Core on x86_64 (fully updated)
- Maui 3.2.6p13
- Keybased ssh already setup from the master to the nodes and between nodes themselves.
Compile / Install
Install Torque into /usr/local.
$ pwd /home/cluster/packages/torque/torque-1.2.0p4 ./configure --prefix=/usr/local --with-scp
The spool directory goes to /usr/spool/PBS by default. I am not sure whether this is the ideal location but it works well for me.
Be sure to put in a start/stop script for pbs_server and pbs_sched after the install. pbs_sched needs to be disabled before switching to Maui. For testing basic functionality of the grid, pbs_sched works well.
Configuration
torque.cfg
SERVERHOST opterome.ncbs.res.in ALLOWCOMPUTEHOSTSUBMIT true
nodes
node1 np=4 node2 np=4 node3 np=4 node4 np=4 node5 np=4 node6 np=4 node7 np=4 node8 np=4 node9 np=4 node10 np=4
server_name
master
Monitoring
Nagios
checkcommands.cfg
define command{
command_name check_mom
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 15002
}
services.cfg
define service{
use generic-service
host_name nodeN
service_description MOM
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups cluster-admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_mom
}