Imagine you would like to perform a large series of calculation. Obviously, you would not run the complete series of calculations at the same time. In principle, you would like to start as many jobs as the number of processors on your computer can handle and start the next job in the series when the previous job has finished. You could write your own program for this, but there are many of such programs available. In this blog post, I will show you how to install and use Torque. Although Torque can be set up to run in a computer cluster, I will show you how you can install and use it on just a single machine. This tutorial is written for Linux Debian, but should in principle also work for Linux Ubuntu and with (hopefully) small modifications for the other distributions.
Download the tarball from the website. Extract, compile and install it on your machine.
wget http://www.adaptivecomputing.com/index.php?wpfb_dl=2868 -O torque-5.1.0.tar.gz
cd torque-5.1.0
./configure --prefix=/opt/gcc-4.7.2/torque-5.1.0
make -j5
sudo make install
Torque uses four daemons that have to be loaded at boot time. Copy their init.d
scripts to the /etc/init.d
folder like so
sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
and add these to the boot procedure
sudo update-rc.d pbs_mom defaults
sudo update-rc.d pbs_server defaults
sudo update-rc.d pbs_sched defaults
sudo update-rc.d trqauthd defaults
Next, we would like to set our machine as both the server as well as the (only) node.
Start the trqauthd
daemon
sudo /etc/init.d/trqauthd start
Log in as root and add the binary folders of Torque to the path.
su
export PATH=/opt/gcc-4.7.2/torque-5.1.0/sbin:/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH
./torque.setup root
If you get an error like the following
qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host:
check your /etc/hosts
file and change the directives in there. Torque only reads the first two columns to match the hostname with an IP adress.
Add your machine to the list of nodes by editing /var/spool/torque/server_priv/nodes
. You have to specify the number of cores in your machine after the np=
directive.
localhost np=6
Edit /var/spool/torque/mom_priv/config
and set your machine as the $pbsserver
. Also configure the bitmap for the logging events.
$pbsserver ST-A1771
$logevent 225
Please note that in the above file, ST-A1771
should be replaced by the name of your local machine. Moreover, this name should match an IP address which can be configured in /etc/hosts
. (thanks to danielmejia55_at_gmail_dot_com
for mentioning this, see comments below)
Start pbs_mom.
/etc/init.d/pbs_mom start
In order for every user to submit files to the queuing system and check the current status of the queue, you would like that every user has the bin
folder of Torque in their $PATH
variable. As such, add the Torque binaries folder to the PATH in /etc/profile
echo 'export PATH=/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH' >> /etc/profile
Finally, log out as root (CTRL+D)
To check that everything is correctly configured, run
pbsnodes -a
If you do not get something like this, you can try to reset pbs_server
. (see below)
state = free
power_state = Running
np = 6
ntype = cluster
status = rectime=1424867568,cpuclock=OnDemand:1998MHz,varattr=,jobs=,state=free,netload=6890398942,gres=,loadave=0.00,ncpus=4,physmem=8066840kb,availmem=11324404kb,totmem=11970324kb,idletime=240,nusers=1,nsessions=2,sessions=3336 27395,uname=Linux ST-A1771 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
To reset pbs_server
, run
sudo /etc/init.d/pbs_server restart
Also start the scheduler
sudo /etc/init.d/pbs_sched start
And test a job by running
echo "sleep 30" | qsub
When you run
qstat
you should see something like
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.ST-A1771 STDIN ivo 0 R batch
Note: danielmejia55_at_gmail_dot_com
has noted (see comments below) that if you encounter an error that certain queue directives are missing, that you need to set these first. He has kindly provided the settings he has used.
Below, an example submission file for a multiprocessor job is given. In the submission file, you specify the name of the job after the PBS -N
directive, the number of nodes and the number of processors per node and finally the maximum time the job is allowed to run. Typically, you would like to run the job in the same folder as where the jobfile is residing. To do so, you can use the $PBS_O_WORKDIR
variable. Furthermore, you can use the $PBS_NP
variable to pass the number of processes to the mpirun
program.
#!/bin/bash
#
#This is an example script example.sh
#
#These commands set up the Torque Environment for your job:
#PBS -N TestJob
#PBS -l nodes=1:ppn=4,walltime=00:12:00
pwd
cd $PBS_O_WORKDIR
pwd
#print the time and date
date
mpirun -np $PBS_NP ./testjob
#print the time and date again
date