Problem of lost runs

David_Michea · January 8, 2018, 4:02pm

Hi,

First a bit of context:

I run dynamic workflows which grows along time by appending workflows with some fireworks dedicated to build and append them: one task which return FWAction(additions=new_wf).
I produce this way a large amount of fireworks (several tens of thousands) for each initial workflow.
I schedule these fireworks on a large amount of cores on a HPC cluster (several hundreds)
These tasks execute quite quickly.

At the beginning (mondoDB freed), everything runs fine. But after a while, when the number of fws get higher, the system seems to be unable to detect properly the completion of the fws : A lot of tasks stays in RUNNING state and therefore don’t allow next WAITING tasks to pass in a READY state, freezing this way the workflows execution.

I never experienced that in my previous runs, maybe because i was using far less tasks that were taking more time to complete, or because i was generating lesser fireworks, or because I was deleting completed workflows before the number of fireworks in DB become high …

I have noticed that when the number of fireworks grows in the mongoDB, the computation needs on the server for each operation grows very quickly …

I tried to overcome the problem with a process running:

until lpad detect_lostruns --refresh ; do echo “retry”; done

( I used until because it failed often with messages like 2018-01-08 15:37:27,876 INFO fw_id 38807 locked. Can’t refresh!or pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22434284420 ).

The same problem occurs when I try to delete completed workflows in an attempt to lowering MongoDB workload : it is very (!) long.

So I wonder on the scalability of Fireworks and if there are some guidelines that I should consider to achieve a good performance and scalability (because we plan to increase the load in the next future (more fws, more compute nodes)

Furthermore Is there a way for fireworks to detect lost runs automatically ? these lost runs stay lost for several days while these jobs should be marked as fizzled since they don’t have ping the server every half hour:

in fw_config.py:

PING_TIME_SECS = 1800 # while Running a job, how often to ping back the server that we’re still alive

RUN_EXPIRATION_SECS = PING_TIME_SECS * 2 # mark job as FIZZLED if not pinged in this time

Have you some advice to overcome this problem ?

I use Fireworks v1.4.1 with python3, and mongodb v3.4.10

Best regards,
David

PS : for the server that hosts mongodb, We use a VM with 4GB RAM & 4 cores (increased from previous conf of 2 cores & 2GB, but it didn’t solve the problem, just delayed it a bit)

Anubhav_Jain · January 11, 2018, 12:08am

Hi David,

We never tried running in the same style as you (dynamically adding tens of thousands of jobs for many workflows) so haven’t encountered this problem.

You might try to update the version of FWS. There were some performance improvements for large workflows contributed to another user, although I think the majority of these were focused on initial add of large workflows. Still, it will not hurt to upgrade and give it a try (I am not super hopeful here, though).

Probably what is happening is the following:

Let’s say that two Fireworks in your workflow finish simultaneously. Both of these Fireworks want to update the workflow with new information, e.g., by adding new Fireworks. Here, there is a danger of encountering a race condition in which a delay between read and write causes the update of the second FW to overwrite the update of the first FW.
In order to prevent this, we only allow one Firework to update the workflow at a time. The second one needs to wait for the first one to finish completely finish updating the workflow before it starts its update. Practically this is achieved with the “WFLock” object in FireWorks. You’ll notice a lot of statements in “launchpad.py” say “with WFLock:” which means that the operation requires a lock.
Unfortunately, such locks can be problematic. For example, if the first FW never releases its lock, the workflow is essentially unable to update itself, plus the second FW might just be wasting your CPU time waiting around for that first FW to release its lock (which it never will).
The behavior is to wait a total of 5 minutes to obtain a lock, after that we just give up and throw and error. This is again to prevent using up CPU time waiting for a lock release that will never happen. In our experience, 5 minutes was enough to be a comfortable margin for any workflow updates. But if you have very large workflows or your MongoDB is going slowly, maybe this is not enough.
You can try increasing the parameter WFLOCK_EXPIRATION_SECS in your FW config if you think you want Fireworks to wait longer to get a lock. But, even if this works, it seems like an awful long time for a job to be sitting around waiting for a database update.
There is also also a parameter called WFLOCK_EXPIRATION_KILL which, if set to True, will forcibly acquire a lock if it can’t get one after WFLOCK_EXPIRATION_SECONDS. The problem here is that you become susceptible to the concurrent update error if you try this.
Note that if you encounter the error “INFO fw_id 38807 locked. Can’t refresh!” and are confident that fw_id 38807 is not currently running and trying to update the workflow, you can manually unlock that workflow using “lpad admin unlock -i 38807”
You can also manually refresh a workflow (without using detect_lostruns) using “lpad admin refresh -i <fw_id>”

I am not sure about your question to detect lost runs automatically. The “lpad detect_lostruns” command should have already detected jobs that are lost. If it is not finding your runs, it is probably because the launch data for those runs indeed shows COMPLETED, but the workflow state is still running. If you can find the corresponding launch data (in the launches collection, or using “lpad get_fws -i <fw_id> -d all”), perhaps that will show as COMPLETED even though the state of the firework in the workflow is RUNNING. This would also be a sign of the locking problem.

Hopefully this is enough to get you started. If you can report more on what you find I can try to provide some more guidance.

···

On Mon, Jan 8, 2018 at 8:02 AM, [email protected] wrote:

Hi,

First a bit of context:

I run dynamic workflows which grows along time by appending workflows with some fireworks dedicated to build and append them: one task which return FWAction(additions=new_wf).
I produce this way a large amount of fireworks (several tens of thousands) for each initial workflow.
I schedule these fireworks on a large amount of cores on a HPC cluster (several hundreds)
These tasks execute quite quickly.

At the beginning (mondoDB freed), everything runs fine. But after a while, when the number of fws get higher, the system seems to be unable to detect properly the completion of the fws : A lot of tasks stays in RUNNING state and therefore don’t allow next WAITING tasks to pass in a READY state, freezing this way the workflows execution.

I never experienced that in my previous runs, maybe because i was using far less tasks that were taking more time to complete, or because i was generating lesser fireworks, or because I was deleting completed workflows before the number of fireworks in DB become high …

I have noticed that when the number of fireworks grows in the mongoDB, the computation needs on the server for each operation grows very quickly …

I tried to overcome the problem with a process running:

until lpad detect_lostruns --refresh ; do echo “retry”; done

( I used until because it failed often with messages like 2018-01-08 15:37:27,876 INFO fw_id 38807 locked. Can’t refresh!or pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22434284420 ).

The same problem occurs when I try to delete completed workflows in an attempt to lowering MongoDB workload : it is very (!) long.

So I wonder on the scalability of Fireworks and if there are some guidelines that I should consider to achieve a good performance and scalability (because we plan to increase the load in the next future (more fws, more compute nodes)

Furthermore Is there a way for fireworks to detect lost runs automatically ? these lost runs stay lost for several days while these jobs should be marked as fizzled since they don’t have ping the server every half hour:

in fw_config.py:

PING_TIME_SECS = 1800 # while Running a job, how often to ping back the server that we’re still alive

RUN_EXPIRATION_SECS = PING_TIME_SECS * 2 # mark job as FIZZLED if not pinged in this time

Have you some advice to overcome this problem ?

I use Fireworks v1.4.1 with python3, and mongodb v3.4.10

Best regards,
David

PS : for the server that hosts mongodb, We use a VM with 4GB RAM & 4 cores (increased from previous conf of 2 cores & 2GB, but it didn’t solve the problem, just delayed it a bit)

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/a857eca2-f2af-4776-8920-d6c2f96ab338%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–

Best,
Anubhav

David_Michea · January 23, 2018, 10:53am

Hi Anubhav,

Sorry for the delay in my response.

I have updated WFS, It changed nothing to my problem, but I figured to manage with it.

When there is a conjonction of a lot of short tasks scheduled at the same time on a large number of workers and there are at the same time a lot of fireworks and workflows in the mongodb, the high stream of queries (in conjonction with the increasing cost of responding to them caused by the increasing size of the fireworks objects) tend to overload the capacity of the launchpad.

Furthermore, I have written a kind of “rocket scheduler”, a script that is submitted to SLURM (instead of rlaunch…) , which is in charge of monitoring the ready / running fws per category (it is able to dynamically submit new jobs to the queue if needed, and to shutdown when the load is reduced) This scheduler launches rockets as subprocesses and enable me to fill all the nodes with the ready fws from all categories, knowing they can have different needs in terms of parallelism for ex.

To achieve this, each rocket_scheduler instance (one for each compute node) ask the launchpad for informations. It is done into a loop, with a sleep() inside to avoid overloading the launchpad.

I think a conjonction of these 3 factors leads me to the problems I have exposed.

I have solved them by:

merging several parallel short tasks into a longer single task (in a for loop), reducing this way the global number of tasks, and reducing the number of queries to the launchpad.
added task to the workflow in charge of periodically delete COMPLETED wflows (reducing this way MongoDB load)
increased the sleep time inside the infinite loop of my rocket_scheduler instances -> the system is less responsive, but as the tasks take longer, it is less critical.
With these improvements, It runs fine.

I can improve it further if I manage to make all the queries to the launchpad by a single rocket_scheduler instance and pass results to the other instances in a master/slave pattern.

So I think we can scale up with fireworks to the next step (we plan to double our compute capacities soon).

Thanks a lot for your good work,
David

···

Le jeudi 11 janvier 2018 01:09:28 UTC+1, ajain a écrit :

Hi David,

We never tried running in the same style as you (dynamically adding tens of thousands of jobs for many workflows) so haven’t encountered this problem.

You might try to update the version of FWS. There were some performance improvements for large workflows contributed to another user, although I think the majority of these were focused on initial add of large workflows. Still, it will not hurt to upgrade and give it a try (I am not super hopeful here, though).

Probably what is happening is the following:

Let’s say that two Fireworks in your workflow finish simultaneously. Both of these Fireworks want to update the workflow with new information, e.g., by adding new Fireworks. Here, there is a danger of encountering a race condition in which a delay between read and write causes the update of the second FW to overwrite the update of the first FW.

In order to prevent this, we only allow one Firework to update the workflow at a time. The second one needs to wait for the first one to finish completely finish updating the workflow before it starts its update. Practically this is achieved with the “WFLock” object in FireWorks. You’ll notice a lot of statements in “launchpad.py” say “with WFLock:” which means that the operation requires a lock.

Unfortunately, such locks can be problematic. For example, if the first FW never releases its lock, the workflow is essentially unable to update itself, plus the second FW might just be wasting your CPU time waiting around for that first FW to release its lock (which it never will).

The behavior is to wait a total of 5 minutes to obtain a lock, after that we just give up and throw and error. This is again to prevent using up CPU time waiting for a lock release that will never happen. In our experience, 5 minutes was enough to be a comfortable margin for any workflow updates. But if you have very large workflows or your MongoDB is going slowly, maybe this is not enough.

You can try increasing the parameter WFLOCK_EXPIRATION_SECS in your FW config if you think you want Fireworks to wait longer to get a lock. But, even if this works, it seems like an awful long time for a job to be sitting around waiting for a database update.

There is also also a parameter called WFLOCK_EXPIRATION_KILL which, if set to True, will forcibly acquire a lock if it can’t get one after WFLOCK_EXPIRATION_SECONDS. The problem here is that you become susceptible to the concurrent update error if you try this.

Note that if you encounter the error “INFO fw_id 38807 locked. Can’t refresh!” and are confident that fw_id 38807 is not currently running and trying to update the workflow, you can manually unlock that workflow using “lpad admin unlock -i 38807”

You can also manually refresh a workflow (without using detect_lostruns) using “lpad admin refresh -i <fw_id>”

I am not sure about your question to detect lost runs automatically. The “lpad detect_lostruns” command should have already detected jobs that are lost. If it is not finding your runs, it is probably because the launch data for those runs indeed shows COMPLETED, but the workflow state is still running. If you can find the corresponding launch data (in the launches collection, or using “lpad get_fws -i <fw_id> -d all”), perhaps that will show as COMPLETED even though the state of the firework in the workflow is RUNNING. This would also be a sign of the locking problem.

Hopefully this is enough to get you started. If you can report more on what you find I can try to provide some more guidance.

On Mon, Jan 8, 2018 at 8:02 AM, [email protected] wrote:

Hi,

First a bit of context:

I run dynamic workflows which grows along time by appending workflows with some fireworks dedicated to build and append them: one task which return FWAction(additions=new_wf).
I produce this way a large amount of fireworks (several tens of thousands) for each initial workflow.
I schedule these fireworks on a large amount of cores on a HPC cluster (several hundreds)
These tasks execute quite quickly.

At the beginning (mondoDB freed), everything runs fine. But after a while, when the number of fws get higher, the system seems to be unable to detect properly the completion of the fws : A lot of tasks stays in RUNNING state and therefore don’t allow next WAITING tasks to pass in a READY state, freezing this way the workflows execution.

I never experienced that in my previous runs, maybe because i was using far less tasks that were taking more time to complete, or because i was generating lesser fireworks, or because I was deleting completed workflows before the number of fireworks in DB become high …

I have noticed that when the number of fireworks grows in the mongoDB, the computation needs on the server for each operation grows very quickly …

I tried to overcome the problem with a process running:

until lpad detect_lostruns --refresh ; do echo “retry”; done

( I used until because it failed often with messages like 2018-01-08 15:37:27,876 INFO fw_id 38807 locked. Can’t refresh!or pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22434284420 ).

The same problem occurs when I try to delete completed workflows in an attempt to lowering MongoDB workload : it is very (!) long.

So I wonder on the scalability of Fireworks and if there are some guidelines that I should consider to achieve a good performance and scalability (because we plan to increase the load in the next future (more fws, more compute nodes)

Furthermore Is there a way for fireworks to detect lost runs automatically ? these lost runs stay lost for several days while these jobs should be marked as fizzled since they don’t have ping the server every half hour:

in fw_config.py:

PING_TIME_SECS = 1800 # while Running a job, how often to ping back the server that we’re still alive

RUN_EXPIRATION_SECS = PING_TIME_SECS * 2 # mark job as FIZZLED if not pinged in this time

Have you some advice to overcome this problem ?

I use Fireworks v1.4.1 with python3, and mongodb v3.4.10

Best regards,
David

PS : for the server that hosts mongodb, We use a VM with 4GB RAM & 4 cores (increased from previous conf of 2 cores & 2GB, but it didn’t solve the problem, just delayed it a bit)

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/a857eca2-f2af-4776-8920-d6c2f96ab338%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–
Best,
Anubhav

Anubhav_Jain · January 23, 2018, 4:47pm

Hi David,

I’m glad to hear you were able to figure out a solution. I am a little surprised that deleting COMPLETED workflows helps, but perhaps a smaller database size just makes the queries go faster (especially if you are also deleting the associated launches and fireworks). It’s possible that running a db compact() (either directly via Mongo or using the “lpad admin tuneup” command in FWS) will help a little.

One other suggestion I would have is to see if you can inspect the Mongo queries to see if particular queries are taking a long time. There is some documentation on profiling here:

https://docs.mongodb.com/manual/tutorial/manage-the-database-profiler/

and there is also the currentOp() command that is a somewhat easier way to see what’s going on:

https://docs.mongodb.com/manual/reference/method/db.currentOp/

If you find that certain queries are taking a long time (e.g., querying a specific key in the FW spec), you might be able to speed them up by adding appropriate indexes to your LaunchPad. You can do this directly via MongoDB or by using the “user_indices” and “wf_user_indices” keys in the my_launchpad.yaml file.

Best,

Anubhav

···

On Tue, Jan 23, 2018 at 2:53 AM, [email protected] wrote:

Hi Anubhav,

Sorry for the delay in my response.

I have updated WFS, It changed nothing to my problem, but I figured to manage with it.

When there is a conjonction of a lot of short tasks scheduled at the same time on a large number of workers and there are at the same time a lot of fireworks and workflows in the mongodb, the high stream of queries (in conjonction with the increasing cost of responding to them caused by the increasing size of the fireworks objects) tend to overload the capacity of the launchpad.

Furthermore, I have written a kind of “rocket scheduler”, a script that is submitted to SLURM (instead of rlaunch…) , which is in charge of monitoring the ready / running fws per category (it is able to dynamically submit new jobs to the queue if needed, and to shutdown when the load is reduced) This scheduler launches rockets as subprocesses and enable me to fill all the nodes with the ready fws from all categories, knowing they can have different needs in terms of parallelism for ex.

To achieve this, each rocket_scheduler instance (one for each compute node) ask the launchpad for informations. It is done into a loop, with a sleep() inside to avoid overloading the launchpad.

I think a conjonction of these 3 factors leads me to the problems I have exposed.

I have solved them by:

merging several parallel short tasks into a longer single task (in a for loop), reducing this way the global number of tasks, and reducing the number of queries to the launchpad.

added task to the workflow in charge of periodically delete COMPLETED wflows (reducing this way MongoDB load)

increased the sleep time inside the infinite loop of my rocket_scheduler instances -> the system is less responsive, but as the tasks take longer, it is less critical.
With these improvements, It runs fine.

I can improve it further if I manage to make all the queries to the launchpad by a single rocket_scheduler instance and pass results to the other instances in a master/slave pattern.

So I think we can scale up with fireworks to the next step (we plan to double our compute capacities soon).

Thanks a lot for your good work,
David

Le jeudi 11 janvier 2018 01:09:28 UTC+1, ajain a écrit :

Hi David,

We never tried running in the same style as you (dynamically adding tens of thousands of jobs for many workflows) so haven’t encountered this problem.

You might try to update the version of FWS. There were some performance improvements for large workflows contributed to another user, although I think the majority of these were focused on initial add of large workflows. Still, it will not hurt to upgrade and give it a try (I am not super hopeful here, though).

Probably what is happening is the following:

Let’s say that two Fireworks in your workflow finish simultaneously. Both of these Fireworks want to update the workflow with new information, e.g., by adding new Fireworks. Here, there is a danger of encountering a race condition in which a delay between read and write causes the update of the second FW to overwrite the update of the first FW.

In order to prevent this, we only allow one Firework to update the workflow at a time. The second one needs to wait for the first one to finish completely finish updating the workflow before it starts its update. Practically this is achieved with the “WFLock” object in FireWorks. You’ll notice a lot of statements in “launchpad.py” say “with WFLock:” which means that the operation requires a lock.

Unfortunately, such locks can be problematic. For example, if the first FW never releases its lock, the workflow is essentially unable to update itself, plus the second FW might just be wasting your CPU time waiting around for that first FW to release its lock (which it never will).

The behavior is to wait a total of 5 minutes to obtain a lock, after that we just give up and throw and error. This is again to prevent using up CPU time waiting for a lock release that will never happen. In our experience, 5 minutes was enough to be a comfortable margin for any workflow updates. But if you have very large workflows or your MongoDB is going slowly, maybe this is not enough.

You can try increasing the parameter WFLOCK_EXPIRATION_SECS in your FW config if you think you want Fireworks to wait longer to get a lock. But, even if this works, it seems like an awful long time for a job to be sitting around waiting for a database update.

There is also also a parameter called WFLOCK_EXPIRATION_KILL which, if set to True, will forcibly acquire a lock if it can’t get one after WFLOCK_EXPIRATION_SECONDS. The problem here is that you become susceptible to the concurrent update error if you try this.

Note that if you encounter the error “INFO fw_id 38807 locked. Can’t refresh!” and are confident that fw_id 38807 is not currently running and trying to update the workflow, you can manually unlock that workflow using “lpad admin unlock -i 38807”

You can also manually refresh a workflow (without using detect_lostruns) using “lpad admin refresh -i <fw_id>”

I am not sure about your question to detect lost runs automatically. The “lpad detect_lostruns” command should have already detected jobs that are lost. If it is not finding your runs, it is probably because the launch data for those runs indeed shows COMPLETED, but the workflow state is still running. If you can find the corresponding launch data (in the launches collection, or using “lpad get_fws -i <fw_id> -d all”), perhaps that will show as COMPLETED even though the state of the firework in the workflow is RUNNING. This would also be a sign of the locking problem.

Hopefully this is enough to get you started. If you can report more on what you find I can try to provide some more guidance.

On Mon, Jan 8, 2018 at 8:02 AM, [email protected] wrote:

Hi,

First a bit of context:

I run dynamic workflows which grows along time by appending workflows with some fireworks dedicated to build and append them: one task which return FWAction(additions=new_wf).
I produce this way a large amount of fireworks (several tens of thousands) for each initial workflow.
I schedule these fireworks on a large amount of cores on a HPC cluster (several hundreds)
These tasks execute quite quickly.

At the beginning (mondoDB freed), everything runs fine. But after a while, when the number of fws get higher, the system seems to be unable to detect properly the completion of the fws : A lot of tasks stays in RUNNING state and therefore don’t allow next WAITING tasks to pass in a READY state, freezing this way the workflows execution.

I never experienced that in my previous runs, maybe because i was using far less tasks that were taking more time to complete, or because i was generating lesser fireworks, or because I was deleting completed workflows before the number of fireworks in DB become high …

I have noticed that when the number of fireworks grows in the mongoDB, the computation needs on the server for each operation grows very quickly …

I tried to overcome the problem with a process running:

until lpad detect_lostruns --refresh ; do echo “retry”; done

( I used until because it failed often with messages like 2018-01-08 15:37:27,876 INFO fw_id 38807 locked. Can’t refresh!or pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22434284420 ).

The same problem occurs when I try to delete completed workflows in an attempt to lowering MongoDB workload : it is very (!) long.

So I wonder on the scalability of Fireworks and if there are some guidelines that I should consider to achieve a good performance and scalability (because we plan to increase the load in the next future (more fws, more compute nodes)

Furthermore Is there a way for fireworks to detect lost runs automatically ? these lost runs stay lost for several days while these jobs should be marked as fizzled since they don’t have ping the server every half hour:

in fw_config.py:

PING_TIME_SECS = 1800 # while Running a job, how often to ping back the server that we’re still alive

RUN_EXPIRATION_SECS = PING_TIME_SECS * 2 # mark job as FIZZLED if not pinged in this time

Have you some advice to overcome this problem ?

I use Fireworks v1.4.1 with python3, and mongodb v3.4.10

Best regards,
David

PS : for the server that hosts mongodb, We use a VM with 4GB RAM & 4 cores (increased from previous conf of 2 cores & 2GB, but it didn’t solve the problem, just delayed it a bit)

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/a857eca2-f2af-4776-8920-d6c2f96ab338%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–
Best,
Anubhav

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fd90f046-54c8-4a98-a313-6b6c12cfd5a2%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–

Best,
Anubhav