An example of a parallel job on pegasus2 using the distributed computing toolbox.
We saw how to run a batch matlab job on pegasus2 using the bsub command. With small changes, this techniqe can be modified to run a loop on multiple CPUs using the Distributed Computing Toolbox.
The job script is modified to use parfor, rather than for, and to open a "matlabpool" of workers. By default, matlab will use 12 workers (as of R2013a/b) to run the parfor loop in parallel. This may be fine, if your job is not too big. We'll cover memory problems later. Let's call this function dct_example.m
%=====================================================================
% DCT Example: Do nothing in parallel. Print datestamp to show
% how it's going.
%
% Art Gleason July 14, 2014
%=====================================================================
N = 20; % bump this up if you want a longer job with many workers
matlabpool open
parfor(ix=1:N)
ixstamp = sprintf('Iteration %d at %s\n', ix, datestr(now));
disp(ixstamp);
pause(5);
end
matlabpool close
Finally, here is the script we'll submit to bsub to start the job (dct_bsub.job). Make sure #BSUB -n uses the correct number of workers, 12 in this case:
#!/bin/bash #BSUB -J DCTexample #BSUB -o %J.out #BSUB -e %J.err #BSUB -W 00:30 #BSUB -q general #BSUB -n 12 # ### Run job # cd not needed if CWD is the right one when this is submitted # In other words, cd to the dir # Note: it IS needed to module load matlab before submitting this matlab < dct_example.m >& dct_example.log
Put these files in the same directory, then cd into that directory. Make sure to "module load matlab". Then, submit the job with:
login4 335% bsub < dct_bsub.job Job <1084064> is submitted to queue.
You can check on the status of the job with the bjobs command: Note, in this case we are using 12 CPU on host n217. Compare this with the bjobs output when running on one cpu.
login4 336% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1084064 agleaso RUN general login4.pega 12*n217.peg DCTexample Jul 16 12:37
In this case, lmstat shows we are using the Distributed Computing Toolbox, in addition to MATLAB:
login6 272% which lmstat
lmstat: aliased to /share/opt/MATLAB/R2013a/etc/lmstat -S MLM -c /share/opt/MATLAB/R2013a/licenses/network.lic
login6 285% lmstat
------------------------------------------------------------------
lmstat - Copyright (c) 1989-2012 Flexera Software LLC. All Rights Reserved.
Flexible License Manager status on Wed 7/16/2014 12:37
License server status: 28518@m2
License file(s) on m2: /share/opt/flexlm/server.lic:
m2: license server UP (MASTER) v11.11
Vendor daemon status (on m2):
MLM: UP v11.11
Feature usage info:
Users of MATLAB: (Total of 6901 licenses issued; Total of 1 licenses in use)
"MATLAB" v31, vendor: MLM
floating license
agleason n217 /dev/tty (v29) (m2/28518 3904), start Wed 7/16 12:37
Users of Aerospace_Toolbox: (Total of 6901 licenses issued; Total of 0 licenses in use)
....Many more toolboxes omitted here for space...
Users of Distrib_Computing_Toolbox: (Total of 6901 licenses issued; Total of 1 licenses in use)
"Distrib_Computing_Toolbox" v31, vendor: MLM
floating license
agleason n217 /dev/tty (v29) (m2/28518 5020), start Wed 7/16 12:38
....Many more toolboxes omitted here for space...
What you can see from the above is that the job started on node n217 and that it checked out one copy of matlab (see the agleason n217 line under lmstat) AND it checked out one copy of the Distrib_Computing_Toolbox on the same node. That's good. If your job uses some other toolboxes you should see a license checked out for each toolbox used.
When the job finishes, it will disappear from the bjobs listing. You should find a file "dct_example.log" that looks like this:
login4 350% cat dct_example.log
Warning: No display specified. You will not be able to display graphics on the screen.
Warning: No window system found. Java option 'MWT' ignored.
< M A T L A B (R) >
Copyright 1984-2013 The MathWorks, Inc.
R2013a (8.1.0.604) 64-bit (glnxa64)
February 15, 2013
No window system found. Java option 'MWT' ignored.
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> >> >> >> >> >> >> >> >> Starting matlabpool using the 'local' profile ... Warning: Found 6 pre-existing communicating job(s) created by matlabpool that
are running. You can use 'matlabpool close force local' to remove all jobs
created by matlabpool.
> In InteractiveClient>InteractiveClient.pRemoveOldJobs at 426
In InteractiveClient>InteractiveClient.start at 260
In MatlabpoolHelper>MatlabpoolHelper.doOpen at 363
In MatlabpoolHelper>MatlabpoolHelper.doMatlabpool at 137
In matlabpool at 139
connected to 12 workers.
>> Iteration 9 at 16-Jul-2014 12:42:06
Iteration 16 at 16-Jul-2014 12:42:06
Iteration 15 at 16-Jul-2014 12:42:06
Iteration 13 at 16-Jul-2014 12:42:06
Iteration 12 at 16-Jul-2014 12:42:06
Iteration 11 at 16-Jul-2014 12:42:06
Iteration 8 at 16-Jul-2014 12:42:06
Iteration 10 at 16-Jul-2014 12:42:06
Iteration 6 at 16-Jul-2014 12:42:06
Iteration 14 at 16-Jul-2014 12:42:06
Iteration 4 at 16-Jul-2014 12:42:06
Iteration 2 at 16-Jul-2014 12:42:06
Iteration 7 at 16-Jul-2014 12:42:11
Iteration 3 at 16-Jul-2014 12:42:11
Iteration 1 at 16-Jul-2014 12:42:11
Iteration 17 at 16-Jul-2014 12:42:11
Iteration 5 at 16-Jul-2014 12:42:11
Iteration 18 at 16-Jul-2014 12:42:11
Iteration 19 at 16-Jul-2014 12:42:11
Iteration 20 at 16-Jul-2014 12:42:11
>> Sending a stop signal to all the workers ... stopped.
>> >> login4 351%
Note how each iteration is (a) NOT sequential and (b) in groups of 12, each group being 5 seconds apart. Actually, since we are only doing 20 iterations there is one group of 12 and one group of 8. Compare this with the output using either of the other two methods.