Spreading the matlab workers over different nodes.
By default, LSF will try to put all your workers on the same node (if available). That can be a problem if you need more memory. One solution is to force each worker to run on a different node using the span parameter to bsub.
Here is an example script for bsub (dct_bsub.job):
#!/bin/bash #BSUB -J DCTexample #BSUB -o %J.out #BSUB -e %J.err #BSUB -W 00:30 #BSUB -q debug #BSUB -n 4 # ### Run job # cd not needed if CWD is the right one when this is submitted # In other words, cd to the dir # Note: it IS needed to module load matlab before submitting this matlab < dct_example.m >& dct_example.log
When we run this script, it puts all 4 workers on one node (n204 in this case):
login4 372% bsub < dct_bsub.job Job <1084155> is submitted to queue. login4 375% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1084155 agleaso RUN debug login4.pega 4*n204.pega DCTexample Jul 16 15:28
If we modify dct_bsub.job as follows, with the -R parameter, the jobs all run on different nodes:
#!/bin/bash #BSUB -J DCTexample #BSUB -o %J.out #BSUB -e %J.err #BSUB -W 00:30 #BSUB -q debug #BSUB -n 4 #BSUB -R "span[ptile=1]" # ### Run job # cd not needed if CWD is the right one when this is submitted # In other words, cd to the dir # Note: it IS needed to module load matlab before submitting this matlab < dct_example.m >& dct_example.log
And here is the result (one worker is started on each of nodes 200,201,202,204):
login4 377% bsub < dct_bsub.job Job <1084156> is submitted to queue. login4 378% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1084156 agleaso RUN debug login4.pega 1*n200.pega DCTexample Jul 16 15:31 1*n201.pegasus.edu 1*n202.pegasus.edu 1*n204.pegasus.edu
This is a good technique if you need more memory than can fit on one machine.