An example of a parallel job on pegasus2 using the distributed computing engine.

We saw how to run a batch matlab job on pegasus2 using the bsub command and how to run a parallel matlab job on pegasus2 using the distributed computing toolbox. With small changes, these techniqes can be modified to run a loop on multiple CPUs using the Distributed Computing Engine.

We use almost the same script from the DCT example. The differences are that we do NOT have to start a matlabpool and we do NOT use bsub directly. Instead we submit this job from within matlab. Here is the script (dce_example.m):

%=====================================================================
% DCE Example: Do nothing in parallel. Print datestamp to show
% how it's going.
% Same as DCT example except no need for matlabpool
%
% Art Gleason July 14, 2014
%=====================================================================
N = 20;       % bump this up if you want a longer job with many workers

parfor(ix=1:N)
  ixstamp = sprintf('Iteration %d at %s\n', ix, datestr(now));
  disp(ixstamp);
  pause(5);
end

And here is how to start it from within matlab:

%=====================================================================
% DCE Example: submit dce_example via "batch" function
%
% Art Gleason July 14, 2014
%=====================================================================

%---Find the Scheduler (this will give a warning, but works for now)---
sched = findResource('scheduler', 'type', 'LSF');

%---Set the number of CPUs to use with the -n flag (20 here)---
set(sched, 'SubmitArguments', '-q general -n 20');

%---Submit the job, note batch uses Nworkers-1  ---
job = batch(sched,'dce_example','matlabpool',19)

You can check on the status of the job with the bjobs command:

login6 393% bjobs
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1084113   agleaso RUN   general    login4.pega 16*n295.peg Job2       Jul 16 14:00
                                               4*n078.pegasus.edu

In this case, the scheduler assigned 16 cpus from n296 and 4 from n078. Furthermore, lmstat shows we are using the Distributed Computing Engine, in addition to MATLAB:

login6 272% which lmstat
lmstat: 	 aliased to /share/opt/MATLAB/R2013a/etc/lmstat -S MLM -c /share/opt/MATLAB/R2013a/licenses/network.lic

login6 364% lmstat
login6 392% lmstat
------------------------------------------------------------------
lmstat - Copyright (c) 1989-2012 Flexera Software LLC. All Rights Reserved.
Flexible License Manager status on Wed 7/16/2014 14:00

License server status: 28518@m2
    License file(s) on m2: /share/opt/flexlm/server.lic:

        m2: license server UP (MASTER) v11.11

Vendor daemon status (on m2):

       MLM: UP v11.11
Feature usage info:

Users of MATLAB:  (Total of 6901 licenses issued;  Total of 1 licenses in use)

  "MATLAB" v31, vendor: MLM
  floating license

    agleason login4 /dev/pts/12 (v29) (m2/28518 111), start Wed 7/16 13:13

Users of SIMULINK:  (Total of 6901 licenses issued;  Total of 0 licenses in use)

Users of Aerospace_Blockset:  (Total of 6901 licenses issued;  Total of 0 licenses in use)

...Many toolboxes omitted for space here....

Users of MATLAB_Distrib_Comp_Engine:  (Total of 32 licenses issued;  Total of 20 licenses in use)

  "MATLAB_Distrib_Comp_Engine" v29, vendor: MLM
  floating license

    agleason n078 /dev/tty (v29) (m2/28518 607), start Wed 7/16 14:00
    agleason n078 /dev/tty (v29) (m2/28518 1307), start Wed 7/16 14:00
    agleason n078 /dev/tty (v29) (m2/28518 4105), start Wed 7/16 14:00
    agleason n078 /dev/tty (v29) (m2/28518 3205), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 2907), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 2605), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 1605), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 705), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 405), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 1405), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 1905), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 2205), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 3805), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 2307), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 5307), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 5407), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 5206), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 2107), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 3605), start Wed 7/16 14:00
    agleason n295 /dev/tty (v29) (m2/28518 905), start Wed 7/16 14:00



LMSTAT looks different when you use the DCE vs the other two methods. Specifically, the MATLAB feature shows I am running on the head node (login6) whereas the DCE workers are running on n078 and n295. Also, one DCE license is checked out for each worker I asked for (in this case 20). Once the job has started, I CAN (and should if it is a long job) quit matlab on the head node and can even log out of the head node and the job will continue to run.

When the job finishes, it will disappear from the bjobs listing. You can find out status of the job from within matlab with "get(job)", for example:

>> get(job)
             Configuration: ''
                      Name: 'dce_example'
                        ID: 2
                  UserName: 'agleason'
                       Tag: 'Created_by_batch'
                     State: 'finished'
                CreateTime: 'Wed Jul 16 13:00:13 GMT-05:00 2014'
                SubmitTime: 'Wed Jul 16 13:00:14 GMT-05:00 2014'
                 StartTime: 'Wed Jul 16 18:01:05 GMT 2014'
                FinishTime: 'Wed Jul 16 18:01:21 GMT 2014'
                     Tasks: [20x1 distcomp.simpletask]
          FileDependencies: {'/nethome/agleason/queue_test/dce_example.m'}
          PathDependencies: {0x1 cell}
                   JobData: []
                    Parent: [1x1 distcomp.lsfscheduler]
                  UserData: []
    MaximumNumberOfWorkers: 20
    MinimumNumberOfWorkers: 20
                      Task: [1x1 distcomp.simpletask]

Note in this case the job ID is '2' (I had run another one before this). If you look in the directory where you started matlab you'll find a bunch of files starting with 'Job2':

login6 394% ls
bsub_bsub.job*       dct_example.log      Job2.diary.txt 
bsub_example.log     dct_example.m        Job2.in.mat 
bsub_example.m       dct_example_nwkr.m   Job2.jobout.mat 
dce_bsub.m           donothing.m          Job2.out.mat 
dce_example.m        Job2/                Job2.state.mat 
dct_bsub.job*        Job2.common.mat      matlab_metadata.mat 

Most of these files are internal matlab data, but there are some useful ones for debugging. Job2/Job2.log has information about the workers and scheduler and typically contains errors if there were any problems starting up (such as license errors). Also the Job2/Task*.diary.txt files can contain the output. For example, in this case, Job2/Task1.diary.txt contains:

login6 401% cat Job2/Task1.diary.txt 
Iteration 5 at 16-Jul-2014 14:01:09

Iteration 17 at 16-Jul-2014 14:01:09

Iteration 4 at 16-Jul-2014 14:01:09

Iteration 10 at 16-Jul-2014 14:01:09

Iteration 19 at 16-Jul-2014 14:01:09

Iteration 6 at 16-Jul-2014 14:01:09

Iteration 11 at 16-Jul-2014 14:01:09

Iteration 18 at 16-Jul-2014 14:01:09

Iteration 7 at 16-Jul-2014 14:01:09

Iteration 15 at 16-Jul-2014 14:01:09

Iteration 12 at 16-Jul-2014 14:01:09

Iteration 16 at 16-Jul-2014 14:01:09

Iteration 3 at 16-Jul-2014 14:01:09

Iteration 14 at 16-Jul-2014 14:01:09

Iteration 9 at 16-Jul-2014 14:01:09

Iteration 13 at 16-Jul-2014 14:01:09

Iteration 8 at 16-Jul-2014 14:01:09

Iteration 2 at 16-Jul-2014 14:01:09

Iteration 1 at 16-Jul-2014 14:01:09

Iteration 20 at 16-Jul-2014 14:01:14

Note how each iteration is (a) NOT sequential and (b) almost all executed at the same time. I believe the way the DCE works is that if you have X CPU it uses X-1 for processing and one as a sort of master. In this case we set X=20 CPU and we can see from the timestamps it ran 19 at once and then one after pausing for 5 seconds. Compare this with the output using either of the other two methods.