some fireworks termination is not detected

Hi,

I have a problem with certain FWs that have completed but still appear as RUNNING. It happens when the task run on the nodes of our hpc facility, but not when I run it in an interactive bash session.

I was wondering how this can happen ? Could it be a connexion problem between nodes and the MongoDB server ? (but in this case the task could not be launched)

Furthermore, it happens very often with a single PyTask, which return a FWAction object : return FWAction(update_spec={‘bands’: bands}). Could it be a cause ?

Is there a way to explicitely tell the launchpad that the FW has completed through the launchpad object ? And possibly raise an exception if the launchpad does not answer ?

Best regards,
David

Hi David,

To be honest I am not sure what might be happening here.

The only edge case I can think of is that your job completes, but hits the walltime while FWS is communicating with the database to update the state to COMPLETED which occurs after the job completes. Typically, the database communication to update the state would only take a few seconds maximum, so the chances of your job completing but then hitting walltime during FWS communication would be quite small.

One thing you could do to try to debug would be to see how long the database communication might be taking. For example, pick a COMPLETED job and examine its Launch object in the database, particularly the timestamps that show you when the job started RUNNING and when it was tagged as COMPLETED. Then compare that time to the actual or expected runtime of your job (if you have that somewhere). If there is a big discrepancy it could be an indicator that the database is taking way to way too long to update for your job, and hitting walltime in the middle of the update.

Do you happen to have very large workflows (e.g. 1000 FWS or more)? I could see this perhaps being a bigger problem as the workflows get larger, although I think we have done a lot recently to speed up database updates of large workflows.

Note that there is no way to explicitly mark a FW as completed. While if you are desperate and risk-taking you could try manually calling the Launchpad.complete_launch() method, I wouldn’t really recommend this and suggest you try fixing the underlying problem.

···

On Thursday, March 28, 2019 at 3:05:43 AM UTC-7, [email protected] wrote:

Hi,

I have a problem with certain FWs that have completed but still appear as RUNNING. It happens when the task run on the nodes of our hpc facility, but not when I run it in an interactive bash session.

I was wondering how this can happen ? Could it be a connexion problem between nodes and the MongoDB server ? (but in this case the task could not be launched)

Furthermore, it happens very often with a single PyTask, which return a FWAction object : return FWAction(update_spec={‘bands’: bands}). Could it be a cause ?

Is there a way to explicitely tell the launchpad that the FW has completed through the launchpad object ? And possibly raise an exception if the launchpad does not answer ?

Best regards,
David

Also, if you have the Launch object (e.g. JSON from MongoDB) for one of the launches that are stuck in such a state, perhaps you could attach that JSON

···

On Tuesday, April 2, 2019 at 2:02:37 PM UTC-7, Anubhav Jain wrote:

Hi David,

To be honest I am not sure what might be happening here.

The only edge case I can think of is that your job completes, but hits the walltime while FWS is communicating with the database to update the state to COMPLETED which occurs after the job completes. Typically, the database communication to update the state would only take a few seconds maximum, so the chances of your job completing but then hitting walltime during FWS communication would be quite small.

One thing you could do to try to debug would be to see how long the database communication might be taking. For example, pick a COMPLETED job and examine its Launch object in the database, particularly the timestamps that show you when the job started RUNNING and when it was tagged as COMPLETED. Then compare that time to the actual or expected runtime of your job (if you have that somewhere). If there is a big discrepancy it could be an indicator that the database is taking way to way too long to update for your job, and hitting walltime in the middle of the update.

Do you happen to have very large workflows (e.g. 1000 FWS or more)? I could see this perhaps being a bigger problem as the workflows get larger, although I think we have done a lot recently to speed up database updates of large workflows.

Note that there is no way to explicitly mark a FW as completed. While if you are desperate and risk-taking you could try manually calling the Launchpad.complete_launch() method, I wouldn’t really recommend this and suggest you try fixing the underlying problem.

On Thursday, March 28, 2019 at 3:05:43 AM UTC-7, [email protected] wrote:

Hi,

I have a problem with certain FWs that have completed but still appear as RUNNING. It happens when the task run on the nodes of our hpc facility, but not when I run it in an interactive bash session.

I was wondering how this can happen ? Could it be a connexion problem between nodes and the MongoDB server ? (but in this case the task could not be launched)

Furthermore, it happens very often with a single PyTask, which return a FWAction object : return FWAction(update_spec={‘bands’: bands}). Could it be a cause ?

Is there a way to explicitely tell the launchpad that the FW has completed through the launchpad object ? And possibly raise an exception if the launchpad does not answer ?

Best regards,
David

Hi Anubhav,

Thanks for your answer. I finally found what was responsible of this behaviour.

After having inlined all the code in the PyTask, I have removed code lines until the end of the PyTAsk is correctly detected …

And the problem was caused by :

# transform user warnings into errors (that can be catch ...)

warnings.simplefilter('error', UserWarning)

I used this to catch numpy warnings :

for ifeat in range(nb_feat):

	try:

		mean_hrl = float(zs_hrl[ifeat][0][0])

	except: # in case of warning (i.e. Nans) catch'em and do not take this segment into account

		continue

The trick was nice, but it has side effects …

Hoping this can be usefull to anyone.

Best regards,
David

I solved the problem using a context manager :

with warnings.catch_warnings():

	warnings.simplefilter('error', UserWarning)

sorry for annoying.

···

Le mercredi 3 avril 2019 18:33:18 UTC+2, [email protected] a écrit :

Hi Anubhav,

Thanks for your answer. I finally found what was responsible of this behaviour.

After having inlined all the code in the PyTask, I have removed code lines until the end of the PyTAsk is correctly detected …

And the problem was caused by :

transform user warnings into errors (that can be catch …)

warnings.simplefilter(‘error’, UserWarning)

I used this to catch numpy warnings :

for ifeat in range(nb_feat):

  try:
  	mean_hrl = float(zs_hrl[ifeat][0][0])
  except: # in case of warning (i.e. Nans) catch'em and do not take this segment into account
  	continue

The trick was nice, but it has side effects …

Hoping this can be usefull to anyone.

Best regards,
David

Thanks for updating us! Glad the problem is solved.

···

Best,
Anubhav