Also, another way to confirm that the caching is the problem is to inspect the MongoDB.
i) Connect to MongoDB
ii) Look at the workflows collection inside your fireworks database
iii) Examine the "wf_states" key, and see if you find any discrepancies between that and the fireworks data.
But again, sending an example is the only way to really debug this.
On Fri, Sep 18, 2015 at 11:21 AM, Anubhav Jain <[email protected]> wrote:
Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic
FWActions?
Best,
Anubhav
On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain <[email protected]> wrote:
Hi Derek,
I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying
the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as
get_wf_by_fw_id(). Then it won't use the "LazyFireWork" which has the caching.
*But* I think it is best if you can send a runnable example. It doesn't need to be 100% reproducible, even if it
fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.
Best,
Anubhav
On Thu, Sep 17, 2015 at 12:52 PM, Derek <[email protected]> wrote:
Hi Anubhav,
Well, my workflow didn't fail in the same spot, so reproducibility isn't deterministic. Some
interesting things I've noticed:
----------
lpad get\_fws \-d count
514
lpad get_fws -s COMPLETED -d count
514
# So all jobs completed
lpad get\_wflows \-s RUNNING
\{
"name": "unnamed WF\-\-1",
"state": "RUNNING",
"states\_list":
"C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-
C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-
C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C",
"created\_on": "2015\-09\-17T02:08:13\.284000"
\}
\# Assuming C means COMPLETED, it looks like everything is done\.
\# However:
lpad get_wflows -s RUNNING -d all | grep RUNNING
"20": "RUNNING",
"29": "RUNNING",
"56": "RUNNING",
"110": "RUNNING",
"83": "RUNNING",
"65": "RUNNING",
"94": "RUNNING",
"11": "RUNNING",
"state": "RUNNING",
----------
So it looks like maybe there's some inconsistency between "lpad get_wflows" calls, depending on the
options passed? Maybe that narrows down the culprit?
If we think that caching is a problem, is there a way to disable it in order to verify that?
Thanks,
Derek
On Wed, 16 Sep 2015, Anubhav Jain wrote:
Hi Derek,
Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don't see
anything past 1.04 that would have affected this issue one way or the other. So if you
are seeing it
in 1.04 chances are that you will still see it in 1.1.3.
Best,
Anubhav
On Wed, Sep 16, 2015 at 9:36 AM, Derek <[email protected]> > wrote:
I experienced this bug using fireworks version 1.04. Has any of the caching code
changed since then?
I'm happy to go through the git logs myself if I know which files to look in.
I'm going to first rerun my entire workflow (using version 1.04) and see how
reproducible the behavior is. If it is perfectly reproducible, I'll try it again in
1.1.3. If it
reproduces there, I'll make a bunch of test/dummy jobs and share that code with
you.
Regards,
Derek
On Tue, 15 Sep 2015, Anubhav Jain wrote:
Hi Derek,
Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
we can reproduce in its entirety would be ideal.
Best,
Anubhav
On Tue, Sep 15, 2015 at 3:13 PM, Derek > <[email protected]> wrote:
Hi Anubhav,
Here is a gist for the output of "lpad get_wflows -i 32 -d all":
https://gist.github.com/anonymous/df811ff4f1aec422e44d
The workflow that I submitted uses a bunch of external (to
python) dependencies, so maybe not the easiest to replicate on
another system.
I could try to make a dummy workflow where the firetasks are
sleep commands--that would be portable. What exactly do you
have in mind? I'm guessing this could be a timing-related
issue, but beyond that I'm not sure.
Regards,
Derek
On Tue, 15 Sep 2015, Anubhav Jain wrote:
Hi Derek,
Thanks for reporting this - there is a potential for
the FW and WFlow states to not match; this is
because a contributor had pushed a change to cache
the FW states inside the WFlow to speed up the
overall operation of
FWS. This causes the storage of FW and WFlow states
to be separate. Unfortunately, if there is a bug in
this code, it would cause the problems that you see.
Do you happen to have any code example I can use to
reproduce this problem? Then I can include it as a
unit test and ask the contributor to fix it. We
already have a basic unit test for this
functionality and it passes,
so I need an example of how to force a failure.
Also, if the "-all" option is too long, you can also
use "lpad get_wflows -i 31 -d more" which will
provide more information but not nearly as much as
"all".
Best,
Anubhav
On Tue, Sep 15, 2015 at 10:26 AM, Derek > <[email protected]> wrote:
Hi Anubhav,
Using lpad get_wflows, I see that one of the
other dependencies for fw_id 32, fw_id 30, is still
RUNNING according to:
$ lpad get_wflows -i 32 -d all | grep '"30"'
"30": "RUNNING",
However, lpad get_fws shows it as COMPLETED:
$ lpad get_fws -i 30
{
"name": "Unnamed FW",
"fw_id": 30,
"state": "COMPLETED",
"created_on":
"2015-09-10T16:45:38.431775",
"updated_on": "2015-09-10T16:52:00.453871"
}
The associated launcher_* for fw_id 30
directory has an empty .error file and a .out file
indicating that the Rocket finished and completed.
The output that the job produced also looks correct
(I ran the
command manually and the results match).
It seems odd that get_wflows and get_fws show
different states for same job.
Happy to provide more info as necessary. The
"-d all" option for lpad get_wflows produces a lot
of output (it's a big workflow), but I can include
it if it's helpful.
Regards,
Derek
On Mon, 14 Sep 2015, Anubhav Jain wrote:
Hi Derek,
The "WAITING" state indicates that there
is a job dependency, i.e. a parent job that the
current job is waiting to be COMPLETED. Those FWs
will not be picked up by rlaunch or qlaunch. Only
"READY"
jobs are set to be run.
See a description of states here:
https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o
f-fws-and-wfs
Next, you probably want to try to debug
why the FW is not READY yet. One thing you can try
is querying the state of all the jobs within the
workflow of the affected FW, e.g.:
lpad get_wflows -i 31
where "31" is the id of one of your
affected FWS. This should print out a report of all
states in your workflow and you can probably see
immediately where things got stuck.
If things are still not clear after that
and you are confident that something might be wrong,
type the command:
lpad get_wflows -i 31 -d all
and paste the result back into this list
so I can take a look at all the links in the
workflow and what is happening.
Best,
Anubhav
On Monday, September 14, 2015 at > 11:05:01 AM UTC-7, Derek wrote:
Hello,
Over the weekend I submitted just
over 500 jobs to Fireworks
(this is the largest pipeline I've
tried to date) and executed
them using:
qlaunch -r rapidfire --nlaunches
infinite --sleep 60
--maxjobs_queue 50
All but 6 of them completed
successfully and I'm trying to figure
out what's happened with those 6.
If I try "qlaunch" or
"rlaunch", neither command
recognizes that there are few jobs
left to complete. For example:
$ qlaunch -r singleshot
2015-09-14 10:47:45,974 INFO No
jobs exist in the LaunchPad for
submission to queue!
Here are some (hopefully) relevant
details. I'm happy to provide
more.
lpad get\_fws \-d count
514
lpad get_fws -s COMPLETED -d
count
508
lpad get\_fws \-s WAITING \-d count
6
lpad get_fws -s FIZZLED -d count
0
Which Fireworks are waiting?
$ lpad get_fws -s WAITING | grep
fw_id | sort
"fw_id": 32,
"fw_id": 33,
"fw_id": 34,
"fw_id": 35,
"fw_id": 36,
"fw_id": 37,
What was Firework with fw_id 31,
and what happend to it?
$ lpad get_fws -i 31 -d more
[See
https://gist.github.com/anonymous/10ea08044f574d190625]
Looking in the launch directory
for fw_id 31 (and looking at the
yaml file I used to submit my
workflow), I know that the Firework
with fw_id 31 should be (as far as
I can tell) the only
dependency for the Firework with
fw_id 32.
Was a launch directory ever
created for the Firework with fw_id
32? It appears not:
grep \-rl '"fw\_id": 32,' \./\*
(the same is true for the other
"waiting" Fireworks)
If I try to rerun these Firworks,
still no luck:
lpad rerun\_fws \-i 32 lpad
rerun_fws -i 32
2015-09-14 10:50:56,652 INFO
Finished setting 1 FWs to rerun
$ qlaunch -r singleshot
2015-09-14 10:52:37,092 INFO No
jobs exist in the LaunchPad for
submission to queue!
Is there anything else I should
try/check/examine?
Thank you,
Derek
PS: Apologies that this post also
got appended to my previous
"qluanching dependent jobs" thread