Fireworks scalability

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)

To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.

···

On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)

Hi,

It is difficult for me to know what could be happening. To perform some more information on the debug messages:

DEBUG - FW with id: 45 is unique!

This just means that FWS did not find any existing job that was a duplicate of this Firework, so this job is ready to run. If you’ve enabled duplicate checking and have entered in duplicated jobs, this message will sometimes show that a duplicate was detected and the job will not be launched per the FWS duplicate handling procedure. See the

DEBUG - Created/updated Launch with launch_id: 11

This means that FWS checked out your FW to be launched and created a launch object to store information like time started, etc.

The next thing that should have happened is that the system should print out a message saying “Checked out FW with id: {}”. If that is not the case, something is tripping up between lines 790 and 810 of launchpad.py. Those lines are trying to update the Firework with the launch information (lines 799/800 - perhaps the root of the issue is time needed to update a large workflow?), and then also look for any other relevant Fireworks to update (there should be none in your case, so this should be quick).

We have done scaling tests up to 1000 Fireworks in a workflow in the past, and have not seen this issue or any indication that it would occur. Is it possible for you to share your FW.json so I can try testing your exact workflow?

Best,

Anubhav

···

On Wednesday, April 22, 2015 at 2:42:01 PM UTC-7, [email protected] wrote:

To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.

On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)

Btw, if you have access to the source code you can debug by inserting further debug lines after each step. i.e. locate these lines in my_launchpad.py (around 799/800):

self._upsert_fws([m_fw])

self._refresh_wf(m_fw.fw_id)

and replace with:

self._upsert_fws([m_fw])

self.m_logger.debug(‘Upsert successful’)

self._refresh_wf(m_fw.fw_id)

self.m_logger.debug(‘Refresh successful’)

By adding those debug messages, you will know if those lines are the culprit or not

···

On Wednesday, April 22, 2015 at 3:31:28 PM UTC-7, Anubhav Jain wrote:

Hi,

It is difficult for me to know what could be happening. To perform some more information on the debug messages:

DEBUG - FW with id: 45 is unique!

This just means that FWS did not find any existing job that was a duplicate of this Firework, so this job is ready to run. If you’ve enabled duplicate checking and have entered in duplicated jobs, this message will sometimes show that a duplicate was detected and the job will not be launched per the FWS duplicate handling procedure. See the

DEBUG - Created/updated Launch with launch_id: 11

This means that FWS checked out your FW to be launched and created a launch object to store information like time started, etc.

The next thing that should have happened is that the system should print out a message saying “Checked out FW with id: {}”. If that is not the case, something is tripping up between lines 790 and 810 of launchpad.py. Those lines are trying to update the Firework with the launch information (lines 799/800 - perhaps the root of the issue is time needed to update a large workflow?), and then also look for any other relevant Fireworks to update (there should be none in your case, so this should be quick).

We have done scaling tests up to 1000 Fireworks in a workflow in the past, and have not seen this issue or any indication that it would occur. Is it possible for you to share your FW.json so I can try testing your exact workflow?

Best,

Anubhav

On Wednesday, April 22, 2015 at 2:42:01 PM UTC-7, [email protected] wrote:

To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.

On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)

Hey Anubhav,

Going off of your advice, I’ve been looking into those functions and finding the source of the issue using pdb. So far, it looks like the worker node is having trouble locking the workflow while retrieving the first Firework. Specifically, while running the launchpad code for “self._refresh_wf(m_fw.fw_id)”, the worker node is getting trapped in the enter function, particularly in the while loop defined at lines 55-73. It looks like the “links_dict” variable remains None indefinitely, which causes the worker to stall within this while loop. I can’t say much for why the self.lp.workflows.find_and_modify function at line 72 is not returning a value, but I am surprised that the code never reaches the “waiting_time > self.expire_secs” limit.

I am still testing to see why the time limit “break” is not being reached, even through the worker runs for well past the 300 sec expire time. As I mentioned but didn’t explain fully, the worker node is merely a single Hopper compute node, instantiated using a qsub script which contains the command:

aprun -n 1 python ../WorkerCode

This WorkerCode is merely:

`from fireworks import FWorker, LaunchPad
from fireworks.core.rocket_launcher import rapidfire, launch_rocket

set up the LaunchPad, get jobs

lp = LaunchPad(‘mongodb03.nersc.gov’,27017,‘crd_fireworks_launchpad’,‘crd_fw_admin’,‘E82RqyUZ’)`

#print("DEBUG: Running launch") rapidfire(lp, FWorker())

There is only one FireWorker, which also confuses my as to why there would be any workflow lock conflicts. Any tips or clues as to where the issue lies would help. If you want any additional information (like the FW.json, which you mentioned before), I can provide it.

···

On Wednesday, April 22, 2015 at 3:33:47 PM UTC-7, Anubhav Jain wrote:

Btw, if you have access to the source code you can debug by inserting further debug lines after each step. i.e. locate these lines in my_launchpad.py (around 799/800):

self._upsert_fws([m_fw])

self._refresh_wf(m_fw.fw_id)

and replace with:

self._upsert_fws([m_fw])

self.m_logger.debug(‘Upsert successful’)

self._refresh_wf(m_fw.fw_id)

self.m_logger.debug(‘Refresh successful’)

By adding those debug messages, you will know if those lines are the culprit or not

On Wednesday, April 22, 2015 at 3:31:28 PM UTC-7, Anubhav Jain wrote:

Hi,

It is difficult for me to know what could be happening. To perform some more information on the debug messages:

DEBUG - FW with id: 45 is unique!

This just means that FWS did not find any existing job that was a duplicate of this Firework, so this job is ready to run. If you’ve enabled duplicate checking and have entered in duplicated jobs, this message will sometimes show that a duplicate was detected and the job will not be launched per the FWS duplicate handling procedure. See the

DEBUG - Created/updated Launch with launch_id: 11

This means that FWS checked out your FW to be launched and created a launch object to store information like time started, etc.

The next thing that should have happened is that the system should print out a message saying “Checked out FW with id: {}”. If that is not the case, something is tripping up between lines 790 and 810 of launchpad.py. Those lines are trying to update the Firework with the launch information (lines 799/800 - perhaps the root of the issue is time needed to update a large workflow?), and then also look for any other relevant Fireworks to update (there should be none in your case, so this should be quick).

We have done scaling tests up to 1000 Fireworks in a workflow in the past, and have not seen this issue or any indication that it would occur. Is it possible for you to share your FW.json so I can try testing your exact workflow?

Best,

Anubhav

On Wednesday, April 22, 2015 at 2:42:01 PM UTC-7, [email protected] wrote:

To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.

On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)

Hi,

It looks like you’re at LBL - if you want, we can meet in person to try and debug it. Here are some notes:

i) Your workflow definitely has a lock stuck on it. Even with one FireWorker, a lock is still created. A lock can get stuck in a workflow in case there is an unclean “stop” to the workflow. However, even in those cases FWS should be able to recover.

ii) I am not sure if you modified your FW_config.yaml to edit FWS settings. If you have, please confirm that the parameter “WFLOCK_EXPIRATION_KILL” is either set to True or not set at all (default value is True).

iii) Another place to look is your launchpad log file. Your my_launchpad.yaml file can specify a “log_dir” parameter that gives a directory where to store the logs. You also want to set the “strm_level” to DEBUG to catch anything. i.e. in your “my_launchpad.yaml” put:

logdir:

strm_lvl: DEBUG

You should either see a message that looks like one of these:

  • Could not find workflow in database

  • FORCIBLY ACQUIRING LOCK, WF

  • Could not get workflow - LOCKED: {}

If you see the last one of those, it means that item (ii) is incorrectly set.

Hope that helps get you further, again please feel free to contact me if needed

Best,

Anubhav

···

On Monday, April 27, 2015 at 5:50:12 PM UTC-7, [email protected] wrote:

Hey Anubhav,

Going off of your advice, I’ve been looking into those functions and finding the source of the issue using pdb. So far, it looks like the worker node is having trouble locking the workflow while retrieving the first Firework. Specifically, while running the launchpad code for “self._refresh_wf(m_fw.fw_id)”, the worker node is getting trapped in the enter function, particularly in the while loop defined at lines 55-73. It looks like the “links_dict” variable remains None indefinitely, which causes the worker to stall within this while loop. I can’t say much for why the self.lp.workflows.find_and_modify function at line 72 is not returning a value, but I am surprised that the code never reaches the “waiting_time > self.expire_secs” limit.

I am still testing to see why the time limit “break” is not being reached, even through the worker runs for well past the 300 sec expire time. As I mentioned but didn’t explain fully, the worker node is merely a single Hopper compute node, instantiated using a qsub script which contains the command:

aprun -n 1 python ../WorkerCode

This WorkerCode is merely:

`from fireworks import FWorker, LaunchPad
from fireworks.core.rocket_launcher import rapidfire, launch_rocket

set up the LaunchPad, get jobs

lp = LaunchPad(‘mongodb03.nersc.gov’,27017,‘crd_fireworks_launchpad’,‘crd_fw_admin’,‘E82RqyUZ’)`

#print("DEBUG: Running launch") rapidfire(lp, FWorker())

There is only one FireWorker, which also confuses my as to why there would be any workflow lock conflicts. Any tips or clues as to where the issue lies would help. If you want any additional information (like the FW.json, which you mentioned before), I can provide it.

On Wednesday, April 22, 2015 at 3:33:47 PM UTC-7, Anubhav Jain wrote:

Btw, if you have access to the source code you can debug by inserting further debug lines after each step. i.e. locate these lines in my_launchpad.py (around 799/800):

self._upsert_fws([m_fw])

self._refresh_wf(m_fw.fw_id)

and replace with:

self._upsert_fws([m_fw])

self.m_logger.debug(‘Upsert successful’)

self._refresh_wf(m_fw.fw_id)

self.m_logger.debug(‘Refresh successful’)

By adding those debug messages, you will know if those lines are the culprit or not

On Wednesday, April 22, 2015 at 3:31:28 PM UTC-7, Anubhav Jain wrote:

Hi,

It is difficult for me to know what could be happening. To perform some more information on the debug messages:

DEBUG - FW with id: 45 is unique!

This just means that FWS did not find any existing job that was a duplicate of this Firework, so this job is ready to run. If you’ve enabled duplicate checking and have entered in duplicated jobs, this message will sometimes show that a duplicate was detected and the job will not be launched per the FWS duplicate handling procedure. See the

DEBUG - Created/updated Launch with launch_id: 11

This means that FWS checked out your FW to be launched and created a launch object to store information like time started, etc.

The next thing that should have happened is that the system should print out a message saying “Checked out FW with id: {}”. If that is not the case, something is tripping up between lines 790 and 810 of launchpad.py. Those lines are trying to update the Firework with the launch information (lines 799/800 - perhaps the root of the issue is time needed to update a large workflow?), and then also look for any other relevant Fireworks to update (there should be none in your case, so this should be quick).

We have done scaling tests up to 1000 Fireworks in a workflow in the past, and have not seen this issue or any indication that it would occur. Is it possible for you to share your FW.json so I can try testing your exact workflow?

Best,

Anubhav

On Wednesday, April 22, 2015 at 2:42:01 PM UTC-7, [email protected] wrote:

To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.

On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:

I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:

2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!

2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11

Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)