Hey Anubhav,
Going off of your advice, I’ve been looking into those functions and finding the source of the issue using pdb. So far, it looks like the worker node is having trouble locking the workflow while retrieving the first Firework. Specifically, while running the launchpad code for “self._refresh_wf(m_fw.fw_id)”, the worker node is getting trapped in the enter function, particularly in the while loop defined at lines 55-73. It looks like the “links_dict” variable remains None indefinitely, which causes the worker to stall within this while loop. I can’t say much for why the self.lp.workflows.find_and_modify function at line 72 is not returning a value, but I am surprised that the code never reaches the “waiting_time > self.expire_secs” limit.
I am still testing to see why the time limit “break” is not being reached, even through the worker runs for well past the 300 sec expire time. As I mentioned but didn’t explain fully, the worker node is merely a single Hopper compute node, instantiated using a qsub script which contains the command:
aprun -n 1 python ../WorkerCode
This WorkerCode is merely:
`from fireworks import FWorker, LaunchPad
from fireworks.core.rocket_launcher import rapidfire, launch_rocket
set up the LaunchPad, get jobs
lp = LaunchPad(‘mongodb03.nersc.gov’,27017,‘crd_fireworks_launchpad’,‘crd_fw_admin’,‘E82RqyUZ’)`
#print("DEBUG: Running launch") rapidfire(lp, FWorker())
There is only one FireWorker, which also confuses my as to why there would be any workflow lock conflicts. Any tips or clues as to where the issue lies would help. If you want any additional information (like the FW.json, which you mentioned before), I can provide it.
···
On Wednesday, April 22, 2015 at 3:33:47 PM UTC-7, Anubhav Jain wrote:
Btw, if you have access to the source code you can debug by inserting further debug lines after each step. i.e. locate these lines in my_launchpad.py (around 799/800):
self._upsert_fws([m_fw])
self._refresh_wf(m_fw.fw_id)
and replace with:
self._upsert_fws([m_fw])
self.m_logger.debug(‘Upsert successful’)
self._refresh_wf(m_fw.fw_id)
self.m_logger.debug(‘Refresh successful’)
By adding those debug messages, you will know if those lines are the culprit or not
On Wednesday, April 22, 2015 at 3:31:28 PM UTC-7, Anubhav Jain wrote:
Hi,
It is difficult for me to know what could be happening. To perform some more information on the debug messages:
DEBUG - FW with id: 45 is unique!
This just means that FWS did not find any existing job that was a duplicate of this Firework, so this job is ready to run. If you’ve enabled duplicate checking and have entered in duplicated jobs, this message will sometimes show that a duplicate was detected and the job will not be launched per the FWS duplicate handling procedure. See the
DEBUG - Created/updated Launch with launch_id: 11
This means that FWS checked out your FW to be launched and created a launch object to store information like time started, etc.
The next thing that should have happened is that the system should print out a message saying “Checked out FW with id: {}”. If that is not the case, something is tripping up between lines 790 and 810 of launchpad.py. Those lines are trying to update the Firework with the launch information (lines 799/800 - perhaps the root of the issue is time needed to update a large workflow?), and then also look for any other relevant Fireworks to update (there should be none in your case, so this should be quick).
We have done scaling tests up to 1000 Fireworks in a workflow in the past, and have not seen this issue or any indication that it would occur. Is it possible for you to share your FW.json so I can try testing your exact workflow?
Best,
Anubhav
On Wednesday, April 22, 2015 at 2:42:01 PM UTC-7, [email protected] wrote:
To provide a bit more information, I am using a compute node on the NERSC Hopper systems as the Fireworker; the worker itself is running the “rapidfire” script to retrieve and execute jobs. The worker seems to stall (i.e. take an indefinite amount of time) to setup the Rocket and/or retrieve the first job.
On Tuesday, April 21, 2015 at 2:48:15 PM UTC-7, [email protected] wrote:
I have been doing some testing with Fireworks and scalability as a part of a workflow framework. I have been having issues while testing larger scale workflows with 1000 and 10000 independent fireworks. As I debug, I notice my workers are stalled after printing the following message while retrieving their first Firework to execute:
2015-04-21 14:41:36,133 - launchpad - DEBUG - FW with id: 45 is unique!
2015-04-21 14:41:36,137 - launchpad - DEBUG - Created/updated Launch with launch_id: 11
Any ideas where the issue may be arising? Is it just the sheer number of jobs or is there something wrong in the code? I don’t seem to have the issues when there are much fewer Fireworks in the workflow (10-100)