Spreading the matlab workers over different nodes.

By default, LSF will try to put all your workers on the same node (if available). That can be a problem if you need more memory. One solution is to force each worker to run on a different node using the span parameter to bsub.

Here is an example script for bsub (dct_bsub.job):

#!/bin/bash
#BSUB -J DCTexample
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -W 00:30
#BSUB -q debug
#BSUB -n 4
#
### Run job
# cd not needed if CWD is the right one when this is submitted
# In other words, cd to the dir
# Note: it IS needed to module load matlab before submitting this
matlab < dct_example.m >& dct_example.log

When we run this script, it puts all 4 workers on one node (n204 in this case):

login4 372% bsub < dct_bsub.job 
Job <1084155> is submitted to queue .

login4 375% bjobs
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1084155   agleaso RUN   debug      login4.pega 4*n204.pega DCTexample Jul 16 15:28

If we modify dct_bsub.job as follows, with the -R parameter, the jobs all run on different nodes:

#!/bin/bash
#BSUB -J DCTexample
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -W 00:30
#BSUB -q debug
#BSUB -n 4
#BSUB -R "span[ptile=1]"
#
### Run job
# cd not needed if CWD is the right one when this is submitted
# In other words, cd to the dir
# Note: it IS needed to module load matlab before submitting this
matlab < dct_example.m >& dct_example.log

And here is the result (one worker is started on each of nodes 200,201,202,204):

login4 377% bsub < dct_bsub.job
Job <1084156> is submitted to queue .

login4 378% bjobs
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1084156   agleaso RUN   debug      login4.pega 1*n200.pega DCTexample Jul 16 15:31
                                               1*n201.pegasus.edu
                                               1*n202.pegasus.edu
                                               1*n204.pegasus.edu

This is a good technique if you need more memory than can fit on one machine.