qlaunch not handling qsub errors?

We’re using qlaunch in reserve offline mode to poll for tasks and to submit PBS jobs to OLCF systems. Our setup works when everything is “correct”, but if a user specifies an invalid PBS parameter (such as an expired or wrong account), qsub fails and returns a non-zero exit code. In this case qlaunch catches the error and logs it, but continues to try to schedule the job at each polling interval. In this case, it seems that qlaunch should mark the task as “fizzled” as this is a non-recoverable error. It there an easy way to make qlaunch behave this way?

Also, I’ve noticed that when an error of this type occurs, the “lpad recover_offline” command reports more and more jobs each time it’s run - rather than just one job that has not yet finished. This may be related to the fact that even though qlaunch is trying to submit the job, it is still shown as “READY” in the MongoDB (instead of the expected “RESERVED”).

Thanks,

Dale

Hi Dale

If the qlaunch process itself fails, then the job never “launched”. In this case, I would not mark the job as FIZZLED - that is what happens when a job actually launches, but fails to complete successfully.

Instead, what I would suggest is that if qlaunch has an error:

  • to stop polling and exit the qlaunch process

  • to make sure that FWS doesn’t think the job was submitted, i.e., the recover_offline command should not see this (unsubmitted) job

What do you think?

Best,

Anubhav

···

On Friday, August 19, 2016 at 10:28:45 AM UTC-7, [email protected] wrote:

We’re using qlaunch in reserve offline mode to poll for tasks and to submit PBS jobs to OLCF systems. Our setup works when everything is “correct”, but if a user specifies an invalid PBS parameter (such as an expired or wrong account), qsub fails and returns a non-zero exit code. In this case qlaunch catches the error and logs it, but continues to try to schedule the job at each polling interval. In this case, it seems that qlaunch should mark the task as “fizzled” as this is a non-recoverable error. It there an easy way to make qlaunch behave this way?

Also, I’ve noticed that when an error of this type occurs, the “lpad recover_offline” command reports more and more jobs each time it’s run - rather than just one job that has not yet finished. This may be related to the fact that even though qlaunch is trying to submit the job, it is still shown as “READY” in the MongoDB (instead of the expected “RESERVED”).

Thanks,

Dale

Hey Anubhav,

I think our main need in this sort of event is to get some kind of error information to users - especially if the error is due to bad job params. Not having the job resubmit would make sense, but we wouldn’t want stop qlaunch as there may be other (unrelated) jobs that could be run. (It’s a bit of a pain for our user to start the qlaunch daemon.) Instead of marking the job fizzled, would it be possible to annotate the Firework task description somehow (with error info)? We’re using the category field to direct tasks to specific machines (titan, rhea, etc), so if we could change that field, qlauncher would no longer “see” the ready task that failed. Just some ideas… we could also just validate all of the users PBS parameters before hand to try to avoid these kinds of errors in the first place.

Thanks,

Dale

Hi Dale,

I think I understand your use case, but I don’t think we can support that at this time. The main suggestion to keep track of the places that a queue submission failed for a job is a lot of complexity to add to FWS (and maintain for all time) in order to work around a problem with the user’s setup.

I would be supportive of some code to validate some PBS parameters before trying to submit the script if you have any ideas on how to go about that.

Finally, I tried to examine the code because I wanted to adjust things so that (i) qlaunch would exit if it has trouble submitting a job, rather than continuing to poll and try to submit things and (ii) to not have a job show up in “recover_offline” if the queue submission failed. But, when inspecting the code, it looks like these things should already be taken care of:

(i) if the job submission failed, the code should have raised a RuntimeError saying “Launch unsuccessful” which should have quit out of the qlaunch script. Could you let me know exactly (a) what command you are using for qlaunch and (b) what is the content of your error log? The only problem I can see is if you are running in remote / daemon mode (neither of which I use personally).

(ii) The job state of READY is the correct state for a queue submission error. The job itself is still ready to go and I purposely rolled back the job state to reflect this. However the entry of the job from the list of offline runs in the database needed to be removed in order to prevent the “recover_offline” command from searching for these jobs. I just pushed a patch for this in FW1.3.5. Note that for older runs, you will need to use the “lpad forget_offline” command to manually forget the affected FWS. Sorry about that -

Best,

Anubhav

···

On Thursday, August 25, 2016 at 12:34:34 PM UTC-7, [email protected] wrote:

Hey Anubhav,

I think our main need in this sort of event is to get some kind of error information to users - especially if the error is due to bad job params. Not having the job resubmit would make sense, but we wouldn’t want stop qlaunch as there may be other (unrelated) jobs that could be run. (It’s a bit of a pain for our user to start the qlaunch daemon.) Instead of marking the job fizzled, would it be possible to annotate the Firework task description somehow (with error info)? We’re using the category field to direct tasks to specific machines (titan, rhea, etc), so if we could change that field, qlauncher would no longer “see” the ready task that failed. Just some ideas… we could also just validate all of the users PBS parameters before hand to try to avoid these kinds of errors in the first place.

Thanks,

Dale