I have two solutions for you.
The one requiring the least change from your current setup is just to establish an exterior directory (outside of the launch dirs) to hold all the data for a set of runs (i.e., one complete MD simulation), and store these directories somewhere in the Fireworks’ specs (so you can look them up later if needed). In your bash script after a checkpoint is made, you could make a dir specific to this set of jobs (if it doesn’t already exist), copy the checkpoint data there, make a queue submission, etc. Then when/if your MD sim finishes completely, have your bash script consolidate the data in this exterior directory into a format which you can easily read.
···
The more maintainable solution is using dynamic workflows.
One way to implement this is with a larger workflow. If your runs right now are just one Firework (lets call it VASP_FW), your dynamic workflow might look like this:
VASP_FW1 - Runs, realizes job won’t finish in time. Checkpoints, dynamically adds new FW (VASP_FW2)
VASP_FW2 - Runs, realizes job won’t finish in time. Checkpoints, dynamically adds new FW (VASP_FW3)
… (process repeats)
VASP_FW_N - Runs, job finishes. Consolidates all the data from Fireworks VASP_FW(1 thru N) into the launch_dir for this Firework, so you have all the checkpoint data in one place (the launch_dir of the final FW).
This scheme will probably require you to write custom Firetasks (see here and here for more info), if you are not already doing so. The main con of this is that there is some added complexity, but the pro is that once it is figured out you will have much more flexibility. You can add new Fireworks to the workflow (through the “additions” argument to the FWAction object at the end of run_task in whatever Firetasks you use to run your MD) and you can pass information to subsequent fireworks i.e., the directories of past checkpoints (either thru the new FW’s spec, through the file-passing interface (files_in and files_out), or through the “mod_spec” or “update_spec” arguments to FWAction). Another added perk is that you will have one workflow for an entire MD run, rather than a bunch of separate Fireworks.
The python psuedocode for your Firetask and Firework(s) could look something like:
class RunMDDynamicTask(FireTaskBase):
def run_task(self, fw_spec):
prev_checkpoint_dirs = fw_spec.get(“checkpoint_dirs”, [])
# run commands for VASP MD, checking walltime, creating checkpoint, etc.
...
if job_finished:
consolidate_checkpoints_to_this_dir(prev_checkpoint_dirs)
return FWAction()
else:
new_fw = Firework(RunMDDynamicTask(), {"checkpoint_dirs": prev_checkpoint_dirs,
# other params that need to be passed to the next FW})
return FWAction(additions=new_fw)
if name == “main”:
vasp_fw1 = Firework(RunMDDynamicTask())
wf = Workflow([vasp_fw1], name=“MD Run for System Z”)
launchpad.add_wf(wf)
``
You’ll notice there is no queue submission in the above workflow description. This is because I’d recommend having a cron-job make queue submissions for you automatically (e.g., every 12 hours), which is completely separate from the operation of the workflow above - mixing workflow execution and queue submission tends to be confusing, for me at least. By having crontab submit your jobs automatically, as soon as one of your fws finishes and the next one is “READY”, a queue submission you made previously will pull and run the next job. While much faster than waiting around for old jobs to finish to make queue submissions for new jobs, it will not preserve the job_id AFAIK (not sure why that would be needed though?)
If you prefer to not do that, I guess you could just add a command for submitting to the queue inside the else block of the above Firetask - ie “if job is not finished, submit to the queue with job id X and add another FW to the workflow”; I’ve never done this though so it could wind up in some goofy behavior.
Thanks,
Alex
On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:
Hello all,
I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:
(1) Submit job for short time (e.g. 5 hours)
(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **
(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)
In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.
The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).
Any thoughts on how to deal with this type of issue?
-Nick