qlaunching dependent jobs

Hello,

I've just gone through the queue tutorial and am thinking about how to apply it to my use cases. I'd like to launch dependent jobs, similar to the "Fibonacci Adder" example (...but more computationally expensive...), but using a queue system. To properly run these kinds of jobs, do I need to be running some variation of:

qlaunch rapidfire --nlaunches infinite

If so, do you generally have a screen/tmux session open on a login node running this command? Or do you simply background the task? Is there another way? It just seems a little weird to be continually running this kind of administration-like task as a normal user on a login node (the particular login node that qlaunch is running on could go down, for example, while the rest of the system is fine), so I feel like I'm missing something.

Thank you,

Derek

Hi Derek,

At our computing center we use a cron job to run qlaunch periodically. However, you can also try the remote qlaunch option (search for it below):

http://pythonhosted.org//FireWorks/queue_tutorial.html

That option will (in most cases) let you execute the qlaunch command from a local machine rather than staying logged into a remote machine. e.g., if you can keep your local machine alive you can execute the commands remotely from there.

Note that at this point, there is no background service that runs at the computing center nor are there plans to include one. However, there are things in development regarding better managing jobs across several clusters from a centralized location, in a similar spirit to remote qlaunch.

Best,

Anubhav

···

On Thu, Feb 5, 2015 at 7:05 PM, Derek [email protected] wrote:

Hello,

I’ve just gone through the queue tutorial and am thinking about how to apply it to my use cases. I’d like to launch dependent jobs, similar to the “Fibonacci Adder” example (…but more computationally expensive…), but using a queue system. To properly run these kinds of jobs, do I need to be running some variation of:

qlaunch rapidfire --nlaunches infinite

If so, do you generally have a screen/tmux session open on a login node running this command? Or do you simply background the task? Is there another way? It just seems a little weird to be continually running this kind of administration-like task as a normal user on a login node (the particular login node that qlaunch is running on could go down, for example, while the rest of the system is fine), so I feel like I’m missing something.

Thank you,

Derek

Hi Anubhav,

Thanks. I'll try cron jobs for now. If that becomes problematic, I'll try remote qlaunch.

Again, thanks for the pointers,

Derek

···

On Thu, 5 Feb 2015, Anubhav Jain wrote:

Hi Derek,
At our computing center we use a cron job to run qlaunch periodically. However, you can also try the remote qlaunch option (search for it
below):

http://pythonhosted.org//FireWorks/queue_tutorial.html

That option will (in most cases) let you execute the qlaunch command from a local machine rather than staying logged into a remote machine.
e.g., if you can keep your local machine alive you can execute the commands remotely from there.

Note that at this point, there is no background service that runs at the computing center nor are there plans to include one. However, there
are things in development regarding better managing jobs across several clusters from a centralized location, in a similar spirit to remote
qlaunch.

Best,
Anubhav

On Thu, Feb 5, 2015 at 7:05 PM, Derek <[email protected]> wrote:
      Hello,

      I've just gone through the queue tutorial and am thinking about how to apply it to my use cases. I'd like to launch dependent
      jobs, similar to the "Fibonacci Adder" example (...but more computationally expensive...), but using a queue system. To properly
      run these kinds of jobs, do I need to be running some variation of:

      qlaunch rapidfire --nlaunches infinite

      If so, do you generally have a screen/tmux session open on a login node running this command? Or do you simply background the
      task? Is there another way? It just seems a little weird to be continually running this kind of administration-like task as a
      normal user on a login node (the particular login node that qlaunch is running on could go down, for example, while the rest of
      the system is fine), so I feel like I'm missing something.

      Thank you,

      Derek

Hello,

One of the ways I've been building up and testing out a pipeline in Fireworks is to submit an entire workflow, then add a new FireWork to the workflow (for the next step in the pipeline), then re-submit the entire workflow. With each FireWork specifying DupeFinderExact as a dupefinder, this works pretty well--only the most recent addition to the pipeline gets executed. However, if I have to go back and tweak an earlier portion of the pipeline, none of the dependent steps get executed.

Is there a way to create a dupefinder that executes dependent FireWorks, even if those FireWorks appear to have specs that haven't changed (e.g., the filenames might not have changed, but the file contents may have)? Alternatively, is there a better way to go about building up a pipeline?

Thanks,

Derek

Hi Derek,

Before getting too far - I assume you do not just want to execute the “lpad rerun_fws” command starting with the changed job? If you rerun a firework, it and all of its children will get rerun, regardless of what dupefinder was there before. see more details here:

http://pythonhosted.org//FireWorks/rerun_tutorial.html

Let me know if that provides a solution for you or not. Reruns can also be done programmatically as long as you have an instance of the LaunchPad object

Best,

Anubhav

···

On Wed, Feb 11, 2015 at 3:01 PM, Derek [email protected] wrote:

Hello,

One of the ways I’ve been building up and testing out a pipeline in Fireworks is to submit an entire workflow, then add a new FireWork to the workflow (for the next step in the pipeline), then re-submit the entire workflow. With each FireWork specifying DupeFinderExact as a dupefinder, this works pretty well–only the most recent addition to the pipeline gets executed. However, if I have to go back and tweak an earlier portion of the pipeline, none of the dependent steps get executed.

Is there a way to create a dupefinder that executes dependent FireWorks, even if those FireWorks appear to have specs that haven’t changed (e.g., the filenames might not have changed, but the file contents may have)? Alternatively, is there a better way to go about building up a pipeline?

Thanks,

Derek

Oh good point, that is pretty much exactly what I want to do in this case. Sorry for overlooking it.

Thanks,

Derek

···

On Wed, 11 Feb 2015, Anubhav Jain wrote:

Hi Derek,
Before getting too far - I assume you do not just want to execute the "lpad rerun_fws" command starting with the changed job? If you rerun a firework, it and all of its
children will get rerun, regardless of what dupefinder was there before. see more details here:

http://pythonhosted.org//FireWorks/rerun_tutorial.html

Let me know if that provides a solution for you or not. Reruns can also be done programmatically as long as you have an instance of the LaunchPad object

Best,
Anubhav

On Wed, Feb 11, 2015 at 3:01 PM, Derek <[email protected]> wrote:
      Hello,

      One of the ways I've been building up and testing out a pipeline in Fireworks is to submit an entire workflow, then add a new FireWork to the workflow (for the
      next step in the pipeline), then re-submit the entire workflow. With each FireWork specifying DupeFinderExact as a dupefinder, this works pretty well--only the
      most recent addition to the pipeline gets executed. However, if I have to go back and tweak an earlier portion of the pipeline, none of the dependent steps get
      executed.

      Is there a way to create a dupefinder that executes dependent FireWorks, even if those FireWorks appear to have specs that haven't changed (e.g., the filenames
      might not have changed, but the file contents may have)? Alternatively, is there a better way to go about building up a pipeline?

      Thanks,

      Derek

Hello,

Over the weekend I submitted just over 500 jobs to Fireworks (this is the largest pipeline I've tried to date) and executed them using:

qlaunch -r rapidfire --nlaunches infinite --sleep 60 --maxjobs_queue 50

All but 6 of them completed successfully and I'm trying to figure out what's happened with those 6. If I try "qlaunch" or "rlaunch", neither command recognizes that there are few jobs left to complete. For example:

$ qlaunch -r singleshot
2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for submission to queue!

Here are some (hopefully) relevant details. I'm happy to provide more.

lpad get\_fws \-d count 514 lpad get_fws -s COMPLETED -d count
508
lpad get\_fws \-s WAITING \-d count 6 lpad get_fws -s FIZZLED -d count
0

Which Fireworks are waiting?

$ lpad get_fws -s WAITING | grep fw_id | sort
"fw_id": 32,
"fw_id": 33,
"fw_id": 34,
"fw_id": 35,
"fw_id": 36,
"fw_id": 37,

What was Firework with fw_id 31, and what happend to it?

$ lpad get_fws -i 31 -d more
[See https://gist.github.com/anonymous/10ea08044f574d190625]

Looking in the launch directory for fw_id 31 (and looking at the yaml file I used to submit my workflow), I know that the Firework with fw_id 31 should be (as far as I can tell) the only dependency for the Firework with fw_id 32.

Was a launch directory ever created for the Firework with fw_id 32? It appears not:

grep \-rl &#39;&quot;fw\_id&quot;: 32,&#39; \./\*

(the same is true for the other "waiting" Fireworks)

If I try to rerun these Firworks, still no luck:

lpad rerun\_fws \-i 32 lpad rerun_fws -i 32
2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun

$ qlaunch -r singleshot
2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for submission to queue!

Is there anything else I should try/check/examine?

Thank you,

Derek