Fireworks With "Waiting" Status Not Launched

Hello,

Over the weekend I submitted just over 500 jobs to Fireworks
(this is the largest pipeline I've tried to date) and executed
them using:

qlaunch -r rapidfire --nlaunches infinite --sleep 60 --maxjobs_queue 50

All but 6 of them completed successfully and I'm trying to figure
out what's happened with those 6. If I try "qlaunch" or
"rlaunch", neither command recognizes that there are few jobs
left to complete. For example:

$ qlaunch -r singleshot
2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for submission to queue!

Here are some (hopefully) relevant details. I'm happy to provide
more.

lpad get\_fws \-d count 514 lpad get_fws -s COMPLETED -d count
508
lpad get\_fws \-s WAITING \-d count 6 lpad get_fws -s FIZZLED -d count
0

Which Fireworks are waiting?

$ lpad get_fws -s WAITING | grep fw_id | sort
"fw_id": 32,
"fw_id": 33,
"fw_id": 34,
"fw_id": 35,
"fw_id": 36,
"fw_id": 37,

What was Firework with fw_id 31, and what happend to it?

$ lpad get_fws -i 31 -d more
[See https://gist.github.com/anonymous/10ea08044f574d190625]

Looking in the launch directory for fw_id 31 (and looking at the
yaml file I used to submit my workflow), I know that the Firework
with fw_id 31 should be (as far as I can tell) the only
dependency for the Firework with fw_id 32.

Was a launch directory ever created for the Firework with fw_id
32? It appears not:

grep \-rl '"fw\_id": 32,' \./\*

(the same is true for the other "waiting" Fireworks)

If I try to rerun these Firworks, still no luck:

lpad rerun\_fws \-i 32 lpad rerun_fws -i 32
2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun

$ qlaunch -r singleshot
2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for submission to queue!

Is there anything else I should try/check/examine?

Thank you,

Derek

PS: Apologies that this post also got appended to my previous "qluanching dependent jobs" thread

Hi Derek,

The “WAITING” state indicates that there is a job dependency, i.e. a parent job that the current job is waiting to be COMPLETED. Those FWs will not be picked up by rlaunch or qlaunch. Only “READY” jobs are set to be run. See a description of states here:

https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs

Next, you probably want to try to debug why the FW is not READY yet. One thing you can try is querying the state of all the jobs within the workflow of the affected FW, e.g.:

lpad get_wflows -i 31

where “31” is the id of one of your affected FWS. This should print out a report of all states in your workflow and you can probably see immediately where things got stuck.

If things are still not clear after that and you are confident that something might be wrong, type the command:

lpad get_wflows -i 31 -d all

and paste the result back into this list so I can take a look at all the links in the workflow and what is happening.

Best,

Anubhav

···

On Monday, September 14, 2015 at 11:05:01 AM UTC-7, Derek wrote:

Hello,

Over the weekend I submitted just over 500 jobs to Fireworks

(this is the largest pipeline I’ve tried to date) and executed

them using:

qlaunch -r rapidfire --nlaunches infinite --sleep 60
–maxjobs_queue 50

All but 6 of them completed successfully and I’m trying to figure

out what’s happened with those 6. If I try “qlaunch” or

“rlaunch”, neither command recognizes that there are few jobs

left to complete. For example:

$ qlaunch -r singleshot

2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for
submission to queue!

Here are some (hopefully) relevant details. I’m happy to provide

more.

$ lpad get_fws -d count

514

$ lpad get_fws -s COMPLETED -d count

508

$ lpad get_fws -s WAITING -d count

6

$ lpad get_fws -s FIZZLED -d count

0

Which Fireworks are waiting?

$ lpad get_fws -s WAITING | grep fw_id | sort

“fw_id”: 32,

“fw_id”: 33,

“fw_id”: 34,

“fw_id”: 35,

“fw_id”: 36,

“fw_id”: 37,

What was Firework with fw_id 31, and what happend to it?

$ lpad get_fws -i 31 -d more

[See https://gist.github.com/anonymous/10ea08044f574d190625]

Looking in the launch directory for fw_id 31 (and looking at the

yaml file I used to submit my workflow), I know that the Firework

with fw_id 31 should be (as far as I can tell) the only

dependency for the Firework with fw_id 32.

Was a launch directory ever created for the Firework with fw_id

32? It appears not:

$ grep -rl ‘“fw_id”: 32,’ ./*

$

(the same is true for the other “waiting” Fireworks)

If I try to rerun these Firworks, still no luck:

lpad rerun_fws -i 32 lpad rerun_fws -i 32

2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun

$ qlaunch -r singleshot

2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for
submission to queue!

Is there anything else I should try/check/examine?

Thank you,

Derek

PS: Apologies that this post also got appended to my previous
“qluanching dependent jobs” thread

Hi Anubhav,

Using lpad get_wflows, I see that one of the other dependencies for fw_id 32, fw_id 30, is still RUNNING according to:

$ lpad get_wflows -i 32 -d all | grep '"30"'
"30": "RUNNING",

However, lpad get_fws shows it as COMPLETED:

$ lpad get_fws -i 30
{
     "name": "Unnamed FW",
     "fw_id": 30,
     "state": "COMPLETED",
     "created_on": "2015-09-10T16:45:38.431775",
     "updated_on": "2015-09-10T16:52:00.453871"
}

The associated launcher_* for fw_id 30 directory has an empty .error file and a .out file indicating that the Rocket finished and completed. The output that the job produced also looks correct (I ran the command manually and the results match).

It seems odd that get_wflows and get_fws show different states for same job.

Happy to provide more info as necessary. The "-d all" option for lpad get_wflows produces a lot of output (it's a big workflow), but I can include it if it's helpful.

Regards,

Derek

···

On Mon, 14 Sep 2015, Anubhav Jain wrote:

Hi Derek,
The "WAITING" state indicates that there is a job dependency, i.e. a parent job that the current job is waiting to be COMPLETED. Those FWs will not be picked up by rlaunch or qlaunch. Only "READY" jobs are set to be run.
See a description of states here:

https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs

Next, you probably want to try to debug why the FW is not READY yet. One thing you can try is querying the state of all the jobs within the workflow of the affected FW, e.g.:

lpad get_wflows -i 31

where "31" is the id of one of your affected FWS. This should print out a report of all states in your workflow and you can probably see immediately where things got stuck.

If things are still not clear after that and you are confident that something might be wrong, type the command:

lpad get_wflows -i 31 -d all

and paste the result back into this list so I can take a look at all the links in the workflow and what is happening.

Best,
Anubhav

On Monday, September 14, 2015 at 11:05:01 AM UTC-7, Derek wrote:
      Hello,

      Over the weekend I submitted just over 500 jobs to Fireworks
      (this is the largest pipeline I've tried to date) and executed
      them using:

      qlaunch -r rapidfire --nlaunches infinite --sleep 60
      --maxjobs_queue 50

      All but 6 of them completed successfully and I'm trying to figure
      out what's happened with those 6. If I try "qlaunch" or
      "rlaunch", neither command recognizes that there are few jobs
      left to complete. For example:

      $ qlaunch -r singleshot
      2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for
      submission to queue!

      Here are some (hopefully) relevant details. I'm happy to provide
      more.

       lpad get\_fws \-d count       514        lpad get_fws -s COMPLETED -d count
      508
       lpad get\_fws \-s WAITING \-d count       6        lpad get_fws -s FIZZLED -d count
      0

      Which Fireworks are waiting?

      $ lpad get_fws -s WAITING | grep fw_id | sort
      "fw_id": 32,
      "fw_id": 33,
      "fw_id": 34,
      "fw_id": 35,
      "fw_id": 36,
      "fw_id": 37,

      What was Firework with fw_id 31, and what happend to it?

      $ lpad get_fws -i 31 -d more
      [See https://gist.github.com/anonymous/10ea08044f574d190625]

      Looking in the launch directory for fw_id 31 (and looking at the
      yaml file I used to submit my workflow), I know that the Firework
      with fw_id 31 should be (as far as I can tell) the only
      dependency for the Firework with fw_id 32.

      Was a launch directory ever created for the Firework with fw_id
      32? It appears not:

       grep \-rl '"fw\_id": 32,' \./\*       

      (the same is true for the other "waiting" Fireworks)

      If I try to rerun these Firworks, still no luck:

       lpad rerun\_fws \-i 32 lpad rerun_fws -i 32
      2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun

      $ qlaunch -r singleshot
      2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for
      submission to queue!

      Is there anything else I should try/check/examine?

      Thank you,

      Derek

      PS: Apologies that this post also got appended to my previous
      "qluanching dependent jobs" thread

Hi Derek,

Thanks for reporting this - there is a potential for the FW and WFlow states to not match; this is because a contributor had pushed a change to cache the FW states inside the WFlow to speed up the overall operation of FWS. This causes the storage of FW and WFlow states to be separate. Unfortunately, if there is a bug in this code, it would cause the problems that you see.

Do you happen to have any code example I can use to reproduce this problem? Then I can include it as a unit test and ask the contributor to fix it. We already have a basic unit test for this functionality and it passes, so I need an example of how to force a failure.

Also, if the “-all” option is too long, you can also use “lpad get_wflows -i 31 -d more” which will provide more information but not nearly as much as “all”.

Best,

Anubhav

···

On Tue, Sep 15, 2015 at 10:26 AM, Derek [email protected] wrote:

Hi Anubhav,

Using lpad get_wflows, I see that one of the other dependencies for fw_id 32, fw_id 30, is still RUNNING according to:

$ lpad get_wflows -i 32 -d all | grep ‘“30”’

“30”: “RUNNING”,

However, lpad get_fws shows it as COMPLETED:

$ lpad get_fws -i 30

{

"name": "Unnamed FW",

"fw_id": 30,

"state": "COMPLETED",

"created_on": "2015-09-10T16:45:38.431775",

"updated_on": "2015-09-10T16:52:00.453871"

}

The associated launcher_* for fw_id 30 directory has an empty .error file and a .out file indicating that the Rocket finished and completed. The output that the job produced also looks correct (I ran the command manually and the results match).

It seems odd that get_wflows and get_fws show different states for same job.

Happy to provide more info as necessary. The “-d all” option for lpad get_wflows produces a lot of output (it’s a big workflow), but I can include it if it’s helpful.

Regards,

Derek

On Mon, 14 Sep 2015, Anubhav Jain wrote:

Hi Derek,

The “WAITING” state indicates that there is a job dependency, i.e. a parent job that the current job is waiting to be COMPLETED. Those FWs will not be picked up by rlaunch or qlaunch. Only “READY” jobs are set to be run.

See a description of states here:

https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs

Next, you probably want to try to debug why the FW is not READY yet. One thing you can try is querying the state of all the jobs within the workflow of the affected FW, e.g.:

lpad get_wflows -i 31

where “31” is the id of one of your affected FWS. This should print out a report of all states in your workflow and you can probably see immediately where things got stuck.

If things are still not clear after that and you are confident that something might be wrong, type the command:

lpad get_wflows -i 31 -d all

and paste the result back into this list so I can take a look at all the links in the workflow and what is happening.

Best,

Anubhav

On Monday, September 14, 2015 at 11:05:01 AM UTC-7, Derek wrote:

  Hello,



  Over the weekend I submitted just over 500 jobs to Fireworks

  (this is the largest pipeline I've tried to date) and executed

  them using:



  qlaunch -r rapidfire --nlaunches infinite --sleep 60

  --maxjobs_queue 50



  All but 6 of them completed successfully and I'm trying to figure

  out what's happened with those 6.  If I try "qlaunch" or

  "rlaunch", neither command recognizes that there are few jobs

  left to complete.  For example:



  $ qlaunch -r singleshot

  2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for

  submission to queue!



  Here are some (hopefully) relevant details.  I'm happy to provide

  more.



  $ lpad get_fws -d count

  514

  $ lpad get_fws -s COMPLETED -d count

  508

  $ lpad get_fws -s WAITING -d count

  6

  $ lpad get_fws -s FIZZLED -d count

  0



  Which Fireworks are waiting?



  $ lpad get_fws -s WAITING | grep fw_id | sort

  "fw_id": 32,

  "fw_id": 33,

  "fw_id": 34,

  "fw_id": 35,

  "fw_id": 36,

  "fw_id": 37,



  What was Firework with fw_id 31, and what happend to it?



  $ lpad get_fws -i 31 -d more

  [See [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



  Looking in the launch directory for fw_id 31 (and looking at the

  yaml file I used to submit my workflow), I know that the Firework

  with fw_id 31 should be (as far as I can tell) the only

  dependency for the Firework with fw_id 32.



  Was a launch directory ever created for the Firework with fw_id

  32?  It appears not:



  $ grep -rl '"fw_id": 32,' ./*

  $



  (the same is true for the other "waiting" Fireworks)



  If I try to rerun these Firworks, still no luck:



  $ lpad rerun_fws -i 32$ lpad rerun_fws -i 32

  2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun



  $ qlaunch -r singleshot

  2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for

  submission to queue!



  Is there anything else I should try/check/examine?



  Thank you,



  Derek



  PS: Apologies that this post also got appended to my previous

  "qluanching dependent jobs" thread

Hi Anubhav,

Here is a gist for the output of "lpad get_wflows -i 32 -d all":

https://gist.github.com/anonymous/df811ff4f1aec422e44d

The workflow that I submitted uses a bunch of external (to python) dependencies, so maybe not the easiest to replicate on another system.

I could try to make a dummy workflow where the firetasks are sleep commands--that would be portable. What exactly do you have in mind? I'm guessing this could be a timing-related issue, but beyond that I'm not sure.

Regards,

Derek

···

On Tue, 15 Sep 2015, Anubhav Jain wrote:

Hi Derek,
Thanks for reporting this - there is a potential for the FW and WFlow states to not match; this is because a contributor had pushed a change to cache the FW states inside the WFlow to speed up the overall operation of
FWS. This causes the storage of FW and WFlow states to be separate. Unfortunately, if there is a bug in this code, it would cause the problems that you see.

Do you happen to have any code example I can use to reproduce this problem? Then I can include it as a unit test and ask the contributor to fix it. We already have a basic unit test for this functionality and it passes,
so I need an example of how to force a failure.

Also, if the "-all" option is too long, you can also use "lpad get_wflows -i 31 -d more" which will provide more information but not nearly as much as "all".

Best,
Anubhav

On Tue, Sep 15, 2015 at 10:26 AM, Derek <d[email protected]> wrote:
      Hi Anubhav,

      Using lpad get_wflows, I see that one of the other dependencies for fw_id 32, fw_id 30, is still RUNNING according to:

      $ lpad get_wflows -i 32 -d all | grep '"30"'
      "30": "RUNNING",

      However, lpad get_fws shows it as COMPLETED:

      $ lpad get_fws -i 30
      {
       "name": "Unnamed FW",
       "fw_id": 30,
       "state": "COMPLETED",
       "created_on": "2015-09-10T16:45:38.431775",
       "updated_on": "2015-09-10T16:52:00.453871"
      }

      The associated launcher_* for fw_id 30 directory has an empty .error file and a .out file indicating that the Rocket finished and completed. The output that the job produced also looks correct (I ran the
      command manually and the results match).

      It seems odd that get_wflows and get_fws show different states for same job.

      Happy to provide more info as necessary. The "-d all" option for lpad get_wflows produces a lot of output (it's a big workflow), but I can include it if it's helpful.

      Regards,

      Derek

      On Mon, 14 Sep 2015, Anubhav Jain wrote:

            Hi Derek,
            The "WAITING" state indicates that there is a job dependency, i.e. a parent job that the current job is waiting to be COMPLETED. Those FWs will not be picked up by rlaunch or qlaunch. Only "READY"
            jobs are set to be run.
            See a description of states here:

            https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs

            Next, you probably want to try to debug why the FW is not READY yet. One thing you can try is querying the state of all the jobs within the workflow of the affected FW, e.g.:

            lpad get_wflows -i 31

            where "31" is the id of one of your affected FWS. This should print out a report of all states in your workflow and you can probably see immediately where things got stuck.

            If things are still not clear after that and you are confident that something might be wrong, type the command:

            lpad get_wflows -i 31 -d all

            and paste the result back into this list so I can take a look at all the links in the workflow and what is happening.

            Best,
            Anubhav

            On Monday, September 14, 2015 at 11:05:01 AM UTC-7, Derek wrote:
             Hello,

             Over the weekend I submitted just over 500 jobs to Fireworks
             (this is the largest pipeline I've tried to date) and executed
             them using:

             qlaunch -r rapidfire --nlaunches infinite --sleep 60
             --maxjobs_queue 50

             All but 6 of them completed successfully and I'm trying to figure
             out what's happened with those 6. If I try "qlaunch" or
             "rlaunch", neither command recognizes that there are few jobs
             left to complete. For example:

             $ qlaunch -r singleshot
             2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for
             submission to queue!

             Here are some (hopefully) relevant details. I'm happy to provide
             more.

             lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s COMPLETED -d count
             508
             lpad get\_fws \-s WAITING \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s FIZZLED -d count
             0

             Which Fireworks are waiting?

             $ lpad get_fws -s WAITING | grep fw_id | sort
             "fw_id": 32,
             "fw_id": 33,
             "fw_id": 34,
             "fw_id": 35,
             "fw_id": 36,
             "fw_id": 37,

             What was Firework with fw_id 31, and what happend to it?

             $ lpad get_fws -i 31 -d more
             [See https://gist.github.com/anonymous/10ea08044f574d190625]

             Looking in the launch directory for fw_id 31 (and looking at the
             yaml file I used to submit my workflow), I know that the Firework
             with fw_id 31 should be (as far as I can tell) the only
             dependency for the Firework with fw_id 32.

             Was a launch directory ever created for the Firework with fw_id
             32? It appears not:

             grep \-rl &#39;&quot;fw\_id&quot;: 32,&#39; \./\* &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

             (the same is true for the other "waiting" Fireworks)

             If I try to rerun these Firworks, still no luck:

             lpad rerun\_fws \-i 32 lpad rerun_fws -i 32
             2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun

             $ qlaunch -r singleshot
             2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for
             submission to queue!

             Is there anything else I should try/check/examine?

             Thank you,

             Derek

             PS: Apologies that this post also got appended to my previous
             "qluanching dependent jobs" thread

Hi Derek,

Indeed, having a dummy workflow (with test tasks or with sleep tasks) that we can reproduce in its entirety would be ideal.

Best,

Anubhav

···

On Tue, Sep 15, 2015 at 3:13 PM, Derek [email protected] wrote:

Hi Anubhav,

Here is a gist for the output of “lpad get_wflows -i 32 -d all”:

https://gist.github.com/anonymous/df811ff4f1aec422e44d

The workflow that I submitted uses a bunch of external (to python) dependencies, so maybe not the easiest to replicate on another system.

I could try to make a dummy workflow where the firetasks are sleep commands–that would be portable. What exactly do you have in mind? I’m guessing this could be a timing-related issue, but beyond that I’m not sure.

Regards,

Derek

On Tue, 15 Sep 2015, Anubhav Jain wrote:

Hi Derek,

Thanks for reporting this - there is a potential for the FW and WFlow states to not match; this is because a contributor had pushed a change to cache the FW states inside the WFlow to speed up the overall operation of

FWS. This causes the storage of FW and WFlow states to be separate. Unfortunately, if there is a bug in this code, it would cause the problems that you see.

Do you happen to have any code example I can use to reproduce this problem? Then I can include it as a unit test and ask the contributor to fix it. We already have a basic unit test for this functionality and it passes,

so I need an example of how to force a failure.

Also, if the “-all” option is too long, you can also use “lpad get_wflows -i 31 -d more” which will provide more information but not nearly as much as “all”.

Best,

Anubhav

On Tue, Sep 15, 2015 at 10:26 AM, Derek [email protected] wrote:

  Hi Anubhav,



  Using lpad get_wflows, I see that one of the other dependencies for fw_id 32, fw_id 30, is still RUNNING according to:



  $ lpad get_wflows -i 32 -d all | grep '"30"'

  "30": "RUNNING",



  However, lpad get_fws shows it as COMPLETED:



  $ lpad get_fws -i 30

  {

      "name": "Unnamed FW",

      "fw_id": 30,

      "state": "COMPLETED",

      "created_on": "2015-09-10T16:45:38.431775",

      "updated_on": "2015-09-10T16:52:00.453871"

  }



  The associated launcher_* for fw_id 30 directory has an empty .error file and a .out file indicating that the Rocket finished and completed.  The output that the job produced also looks correct (I ran the

  command manually and the results match).



  It seems odd that get_wflows and get_fws show different states for same job.



  Happy to provide more info as necessary.  The "-d all" option for lpad get_wflows produces a lot of output (it's a big workflow), but I can include it if it's helpful.



  Regards,



  Derek



  On Mon, 14 Sep 2015, Anubhav Jain wrote:



        Hi Derek,

        The "WAITING" state indicates that there is a job dependency, i.e. a parent job that the current job is waiting to be COMPLETED. Those FWs will not be picked up by rlaunch or qlaunch. Only "READY"

        jobs are set to be run.

        See a description of states here:



        [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-of-fws-and-wfs)



        Next, you probably want to try to debug why the FW is not READY yet. One thing you can try is querying the state of all the jobs within the workflow of the affected FW, e.g.:



        lpad get_wflows -i 31



        where "31" is the id of one of your affected FWS. This should print out a report of all states in your workflow and you can probably see immediately where things got stuck.



        If things are still not clear after that and you are confident that something might be wrong, type the command:



        lpad get_wflows -i 31 -d all



        and paste the result back into this list so I can take a look at all the links in the workflow and what is happening.



        Best,

        Anubhav





        On Monday, September 14, 2015 at 11:05:01 AM UTC-7, Derek wrote:

              Hello,



              Over the weekend I submitted just over 500 jobs to Fireworks

              (this is the largest pipeline I've tried to date) and executed

              them using:



              qlaunch -r rapidfire --nlaunches infinite --sleep 60

              --maxjobs_queue 50



              All but 6 of them completed successfully and I'm trying to figure

              out what's happened with those 6.  If I try "qlaunch" or

              "rlaunch", neither command recognizes that there are few jobs

              left to complete.  For example:



              $ qlaunch -r singleshot

              2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for

              submission to queue!



              Here are some (hopefully) relevant details.  I'm happy to provide

              more.



              $ lpad get_fws -d count

              514

              $ lpad get_fws -s COMPLETED -d count

              508

              $ lpad get_fws -s WAITING -d count

              6

              $ lpad get_fws -s FIZZLED -d count

              0



              Which Fireworks are waiting?



              $ lpad get_fws -s WAITING | grep fw_id | sort

              "fw_id": 32,

              "fw_id": 33,

              "fw_id": 34,

              "fw_id": 35,

              "fw_id": 36,

              "fw_id": 37,



              What was Firework with fw_id 31, and what happend to it?



              $ lpad get_fws -i 31 -d more

              [See [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



              Looking in the launch directory for fw_id 31 (and looking at the

              yaml file I used to submit my workflow), I know that the Firework

              with fw_id 31 should be (as far as I can tell) the only

              dependency for the Firework with fw_id 32.



              Was a launch directory ever created for the Firework with fw_id

              32?  It appears not:



              $ grep -rl '"fw_id": 32,' ./*

              $



              (the same is true for the other "waiting" Fireworks)



              If I try to rerun these Firworks, still no luck:



              $ lpad rerun_fws -i 32$ lpad rerun_fws -i 32

              2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun



              $ qlaunch -r singleshot

              2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for

              submission to queue!



              Is there anything else I should try/check/examine?



              Thank you,



              Derek



              PS: Apologies that this post also got appended to my previous

              "qluanching dependent jobs" thread

I experienced this bug using fireworks version 1.04. Has any of the caching code changed since then?

I'm happy to go through the git logs myself if I know which files to look in.

I'm going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is. If it is perfectly reproducible, I'll try it again in 1.1.3. If it reproduces there, I'll make a bunch of test/dummy jobs and share that code with you.

Regards,

Derek

···

On Tue, 15 Sep 2015, Anubhav Jain wrote:

Hi Derek,
Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
we can reproduce in its entirety would be ideal.

Best,
Anubhav

On Tue, Sep 15, 2015 at 3:13 PM, Derek > <[email protected]> wrote:
      Hi Anubhav,

      Here is a gist for the output of "lpad get_wflows -i 32 -d all":

      https://gist.github.com/anonymous/df811ff4f1aec422e44d

      The workflow that I submitted uses a bunch of external (to
      python) dependencies, so maybe not the easiest to replicate on
      another system.

      I could try to make a dummy workflow where the firetasks are
      sleep commands--that would be portable. What exactly do you
      have in mind? I'm guessing this could be a timing-related
      issue, but beyond that I'm not sure.

      Regards,

      Derek

      On Tue, 15 Sep 2015, Anubhav Jain wrote:

            Hi Derek,
            Thanks for reporting this - there is a potential for
            the FW and WFlow states to not match; this is
            because a contributor had pushed a change to cache
            the FW states inside the WFlow to speed up the
            overall operation of
            FWS. This causes the storage of FW and WFlow states
            to be separate. Unfortunately, if there is a bug in
            this code, it would cause the problems that you see.

            Do you happen to have any code example I can use to
            reproduce this problem? Then I can include it as a
            unit test and ask the contributor to fix it. We
            already have a basic unit test for this
            functionality and it passes,
            so I need an example of how to force a failure.

            Also, if the "-all" option is too long, you can also
            use "lpad get_wflows -i 31 -d more" which will
            provide more information but not nearly as much as
            "all".

            Best,
            Anubhav

            On Tue, Sep 15, 2015 at 10:26 AM, Derek > <[email protected]> wrote:
             Hi Anubhav,

             Using lpad get_wflows, I see that one of the
            other dependencies for fw_id 32, fw_id 30, is still
            RUNNING according to:

             $ lpad get_wflows -i 32 -d all | grep '"30"'
             "30": "RUNNING",

             However, lpad get_fws shows it as COMPLETED:

             $ lpad get_fws -i 30
             {
             "name": "Unnamed FW",
             "fw_id": 30,
             "state": "COMPLETED",
             "created_on":
            "2015-09-10T16:45:38.431775",
             "updated_on": "2015-09-10T16:52:00.453871"
             }

             The associated launcher_* for fw_id 30
            directory has an empty .error file and a .out file
            indicating that the Rocket finished and completed.
            The output that the job produced also looks correct
            (I ran the
             command manually and the results match).

             It seems odd that get_wflows and get_fws show
            different states for same job.

             Happy to provide more info as necessary. The
            "-d all" option for lpad get_wflows produces a lot
            of output (it's a big workflow), but I can include
            it if it's helpful.

             Regards,

             Derek

             On Mon, 14 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             The "WAITING" state indicates that there
            is a job dependency, i.e. a parent job that the
            current job is waiting to be COMPLETED. Those FWs
            will not be picked up by rlaunch or qlaunch. Only
            "READY"
             jobs are set to be run.
             See a description of states here:

             https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o
            f-fws-and-wfs

             Next, you probably want to try to debug
            why the FW is not READY yet. One thing you can try
            is querying the state of all the jobs within the
            workflow of the affected FW, e.g.:

             lpad get_wflows -i 31

             where "31" is the id of one of your
            affected FWS. This should print out a report of all
            states in your workflow and you can probably see
            immediately where things got stuck.

             If things are still not clear after that
            and you are confident that something might be wrong,
            type the command:

             lpad get_wflows -i 31 -d all

             and paste the result back into this list
            so I can take a look at all the links in the
            workflow and what is happening.

             Best,
             Anubhav

             On Monday, September 14, 2015 at > 11:05:01 AM UTC-7, Derek wrote:
             Hello,

             Over the weekend I submitted just
            over 500 jobs to Fireworks
             (this is the largest pipeline I've
            tried to date) and executed
             them using:

             qlaunch -r rapidfire --nlaunches
            infinite --sleep 60
             --maxjobs_queue 50

             All but 6 of them completed
            successfully and I'm trying to figure
             out what's happened with those 6.
            If I try "qlaunch" or
             "rlaunch", neither command
            recognizes that there are few jobs
             left to complete. For example:

             $ qlaunch -r singleshot
             2015-09-14 10:47:45,974 INFO No
            jobs exist in the LaunchPad for
             submission to queue!

             Here are some (hopefully) relevant
            details. I'm happy to provide
             more.

             lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s COMPLETED -d
            count
             508
             lpad get\_fws \-s WAITING \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s FIZZLED -d count
             0

             Which Fireworks are waiting?

             $ lpad get_fws -s WAITING | grep
            fw_id | sort
             "fw_id": 32,
             "fw_id": 33,
             "fw_id": 34,
             "fw_id": 35,
             "fw_id": 36,
             "fw_id": 37,

             What was Firework with fw_id 31,
            and what happend to it?

             $ lpad get_fws -i 31 -d more
             [See
            https://gist.github.com/anonymous/10ea08044f574d190625]

             Looking in the launch directory
            for fw_id 31 (and looking at the
             yaml file I used to submit my
            workflow), I know that the Firework
             with fw_id 31 should be (as far as
            I can tell) the only
             dependency for the Firework with
            fw_id 32.

             Was a launch directory ever
            created for the Firework with fw_id
             32? It appears not:

             grep \-rl &#39;&quot;fw\_id&quot;: 32,&#39; \./\* &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

             (the same is true for the other
            "waiting" Fireworks)

             If I try to rerun these Firworks,
            still no luck:

             lpad rerun\_fws \-i 32 lpad
            rerun_fws -i 32
             2015-09-14 10:50:56,652 INFO
            Finished setting 1 FWs to rerun

             $ qlaunch -r singleshot
             2015-09-14 10:52:37,092 INFO No
            jobs exist in the LaunchPad for
             submission to queue!

             Is there anything else I should
            try/check/examine?

             Thank you,

             Derek

             PS: Apologies that this post also
            got appended to my previous
             "qluanching dependent jobs" thread

Hi Derek,

Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don’t see anything past 1.04 that would have affected this issue one way or the other. So if you are seeing it in 1.04 chances are that you will still see it in 1.1.3.

Best,

Anubhav

···

On Wed, Sep 16, 2015 at 9:36 AM, Derek [email protected] wrote:

I experienced this bug using fireworks version 1.04. Has any of the caching code changed since then?

I’m happy to go through the git logs myself if I know which files to look in.

I’m going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is. If it is perfectly reproducible, I’ll try it again in 1.1.3. If it reproduces there, I’ll make a bunch of test/dummy jobs and share that code with you.

Regards,

Derek

On Tue, 15 Sep 2015, Anubhav Jain wrote:

Hi Derek,

Indeed, having a dummy workflow (with test tasks or with sleep tasks) that

we can reproduce in its entirety would be ideal.

Best,

Anubhav

On Tue, Sep 15, 2015 at 3:13 PM, Derek

[email protected] wrote:

  Hi Anubhav,



  Here is a gist for the output of "lpad get_wflows -i 32 -d all":



  [https://gist.github.com/anonymous/df811ff4f1aec422e44d](https://gist.github.com/anonymous/df811ff4f1aec422e44d)



  The workflow that I submitted uses a bunch of external (to

  python) dependencies, so maybe not the easiest to replicate on

  another system.



  I could try to make a dummy workflow where the firetasks are

  sleep commands--that would be portable.  What exactly do you

  have in mind?  I'm guessing this could be a timing-related

  issue, but beyond that I'm not sure.



  Regards,



  Derek



  On Tue, 15 Sep 2015, Anubhav Jain wrote:



        Hi Derek,

        Thanks for reporting this - there is a potential for

        the FW and WFlow states to not match; this is

        because a contributor had pushed a change to cache

        the FW states inside the WFlow to speed up the

        overall operation of

        FWS. This causes the storage of FW and WFlow states

        to be separate. Unfortunately, if there is a bug in

        this code, it would cause the problems that you see.



        Do you happen to have any code example I can use to

        reproduce this problem? Then I can include it as a

        unit test and ask the contributor to fix it. We

        already have a basic unit test for this

        functionality and it passes,

        so I need an example of how to force a failure.



        Also, if the "-all" option is too long, you can also

        use "lpad get_wflows -i 31 -d more" which will

        provide more information but not nearly as much as

        "all". 



        Best,

        Anubhav



        On Tue, Sep 15, 2015 at 10:26 AM, Derek

        <[email protected]> wrote:

              Hi Anubhav,



              Using lpad get_wflows, I see that one of the

        other dependencies for fw_id 32, fw_id 30, is still

        RUNNING according to:



              $ lpad get_wflows -i 32 -d all | grep '"30"'

              "30": "RUNNING",



              However, lpad get_fws shows it as COMPLETED:



              $ lpad get_fws -i 30

              {

                  "name": "Unnamed FW",

                  "fw_id": 30,

                  "state": "COMPLETED",

                  "created_on":

        "2015-09-10T16:45:38.431775",

                  "updated_on": "2015-09-10T16:52:00.453871"

              }



              The associated launcher_* for fw_id 30

        directory has an empty .error file and a .out file

        indicating that the Rocket finished and completed. 

        The output that the job produced also looks correct

        (I ran the

              command manually and the results match).



              It seems odd that get_wflows and get_fws show

        different states for same job.



              Happy to provide more info as necessary.  The

        "-d all" option for lpad get_wflows produces a lot

        of output (it's a big workflow), but I can include

        it if it's helpful.



              Regards,



              Derek



              On Mon, 14 Sep 2015, Anubhav Jain wrote:



                    Hi Derek,

                    The "WAITING" state indicates that there

        is a job dependency, i.e. a parent job that the

        current job is waiting to be COMPLETED. Those FWs

        will not be picked up by rlaunch or qlaunch. Only

        "READY"

                    jobs are set to be run.

                    See a description of states here:



                   [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o)

        f-fws-and-wfs



                    Next, you probably want to try to debug

        why the FW is not READY yet. One thing you can try

        is querying the state of all the jobs within the

        workflow of the affected FW, e.g.:



                    lpad get_wflows -i 31



                    where "31" is the id of one of your

        affected FWS. This should print out a report of all

        states in your workflow and you can probably see

        immediately where things got stuck.



                    If things are still not clear after that

        and you are confident that something might be wrong,

        type the command:



                    lpad get_wflows -i 31 -d all



                    and paste the result back into this list

        so I can take a look at all the links in the

        workflow and what is happening.



                    Best,

                    Anubhav





                    On Monday, September 14, 2015 at

        11:05:01 AM UTC-7, Derek wrote:

                          Hello,



                          Over the weekend I submitted just

        over 500 jobs to Fireworks

                          (this is the largest pipeline I've

        tried to date) and executed

                          them using:



                          qlaunch -r rapidfire --nlaunches

        infinite --sleep 60

                          --maxjobs_queue 50



                          All but 6 of them completed

        successfully and I'm trying to figure

                          out what's happened with those 6. 

        If I try "qlaunch" or

                          "rlaunch", neither command

        recognizes that there are few jobs

                          left to complete.  For example:



                          $ qlaunch -r singleshot

                          2015-09-14 10:47:45,974 INFO No

        jobs exist in the LaunchPad for

                          submission to queue!



                          Here are some (hopefully) relevant

        details.  I'm happy to provide

                          more.



                          $ lpad get_fws -d count

                          514

                          $ lpad get_fws -s COMPLETED -d

        count

                          508

                          $ lpad get_fws -s WAITING -d count

                          6

                          $ lpad get_fws -s FIZZLED -d count

                          0



                          Which Fireworks are waiting?



                          $ lpad get_fws -s WAITING | grep

        fw_id | sort

                          "fw_id": 32,

                          "fw_id": 33,

                          "fw_id": 34,

                          "fw_id": 35,

                          "fw_id": 36,

                          "fw_id": 37,



                          What was Firework with fw_id 31,

        and what happend to it?



                          $ lpad get_fws -i 31 -d more

                          [See

        [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



                          Looking in the launch directory

        for fw_id 31 (and looking at the

                          yaml file I used to submit my

        workflow), I know that the Firework

                          with fw_id 31 should be (as far as

        I can tell) the only

                          dependency for the Firework with

        fw_id 32.



                          Was a launch directory ever

        created for the Firework with fw_id

                          32?  It appears not:



                          $ grep -rl '"fw_id": 32,' ./*

                          $



                          (the same is true for the other

        "waiting" Fireworks)



                          If I try to rerun these Firworks,

        still no luck:



                          $ lpad rerun_fws -i 32$ lpad

        rerun_fws -i 32

                          2015-09-14 10:50:56,652 INFO

        Finished setting 1 FWs to rerun



                          $ qlaunch -r singleshot

                          2015-09-14 10:52:37,092 INFO No

        jobs exist in the LaunchPad for

                          submission to queue!



                          Is there anything else I should

        try/check/examine?



                          Thank you,



                          Derek



                          PS: Apologies that this post also

        got appended to my previous

                          "qluanching dependent jobs" thread

Hi Anubhav,

Well, my workflow didn't fail in the same spot, so reproducibility isn't deterministic. Some interesting things I've noticed:

···

----------

lpad get\_fws \-d count 514 lpad get_fws -s COMPLETED -d count
514
# So all jobs completed

lpad get\_wflows \-s RUNNING \{ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;name&quot;: &quot;unnamed WF\-\-1&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;state&quot;: &quot;RUNNING&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;states\_list&quot;: &quot;C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;created\_on&quot;: &quot;2015\-09\-17T02:08:13\.284000&quot; \} \# Assuming C means COMPLETED, it looks like everything is done\. \# However: lpad get_wflows -s RUNNING -d all | grep RUNNING
         "20": "RUNNING",
         "29": "RUNNING",
         "56": "RUNNING",
         "110": "RUNNING",
         "83": "RUNNING",
         "65": "RUNNING",
         "94": "RUNNING",
         "11": "RUNNING",
     "state": "RUNNING",

----------

So it looks like maybe there's some inconsistency between "lpad get_wflows" calls, depending on the options passed? Maybe that narrows down the culprit?

If we think that caching is a problem, is there a way to disable it in order to verify that?

Thanks,

Derek

On Wed, 16 Sep 2015, Anubhav Jain wrote:

Hi Derek,
Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don't see anything past 1.04 that would have affected this issue one way or the other. So if you are seeing it
in 1.04 chances are that you will still see it in 1.1.3.

Best,
Anubhav

On Wed, Sep 16, 2015 at 9:36 AM, Derek <[email protected]> wrote:
      I experienced this bug using fireworks version 1.04. Has any of the caching code changed since then?

      I'm happy to go through the git logs myself if I know which files to look in.

      I'm going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is. If it is perfectly reproducible, I'll try it again in 1.1.3. If it
      reproduces there, I'll make a bunch of test/dummy jobs and share that code with you.

      Regards,

      Derek

      On Tue, 15 Sep 2015, Anubhav Jain wrote:

            Hi Derek,
            Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
            we can reproduce in its entirety would be ideal.

            Best,
            Anubhav

            On Tue, Sep 15, 2015 at 3:13 PM, Derek > <[email protected]> wrote:
             Hi Anubhav,

             Here is a gist for the output of "lpad get_wflows -i 32 -d all":

             https://gist.github.com/anonymous/df811ff4f1aec422e44d

             The workflow that I submitted uses a bunch of external (to
             python) dependencies, so maybe not the easiest to replicate on
             another system.

             I could try to make a dummy workflow where the firetasks are
             sleep commands--that would be portable. What exactly do you
             have in mind? I'm guessing this could be a timing-related
             issue, but beyond that I'm not sure.

             Regards,

             Derek

             On Tue, 15 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             Thanks for reporting this - there is a potential for
             the FW and WFlow states to not match; this is
             because a contributor had pushed a change to cache
             the FW states inside the WFlow to speed up the
             overall operation of
             FWS. This causes the storage of FW and WFlow states
             to be separate. Unfortunately, if there is a bug in
             this code, it would cause the problems that you see.

             Do you happen to have any code example I can use to
             reproduce this problem? Then I can include it as a
             unit test and ask the contributor to fix it. We
             already have a basic unit test for this
             functionality and it passes,
             so I need an example of how to force a failure.

             Also, if the "-all" option is too long, you can also
             use "lpad get_wflows -i 31 -d more" which will
             provide more information but not nearly as much as
             "all".

             Best,
             Anubhav

             On Tue, Sep 15, 2015 at 10:26 AM, Derek > <[email protected]> wrote:
             Hi Anubhav,

             Using lpad get_wflows, I see that one of the
             other dependencies for fw_id 32, fw_id 30, is still
             RUNNING according to:

             $ lpad get_wflows -i 32 -d all | grep '"30"'
             "30": "RUNNING",

             However, lpad get_fws shows it as COMPLETED:

             $ lpad get_fws -i 30
             {
             "name": "Unnamed FW",
             "fw_id": 30,
             "state": "COMPLETED",
             "created_on":
             "2015-09-10T16:45:38.431775",
             "updated_on": "2015-09-10T16:52:00.453871"
             }

             The associated launcher_* for fw_id 30
             directory has an empty .error file and a .out file
             indicating that the Rocket finished and completed.
             The output that the job produced also looks correct
             (I ran the
             command manually and the results match).

             It seems odd that get_wflows and get_fws show
             different states for same job.

             Happy to provide more info as necessary. The
             "-d all" option for lpad get_wflows produces a lot
             of output (it's a big workflow), but I can include
             it if it's helpful.

             Regards,

             Derek

             On Mon, 14 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             The "WAITING" state indicates that there
             is a job dependency, i.e. a parent job that the
             current job is waiting to be COMPLETED. Those FWs
             will not be picked up by rlaunch or qlaunch. Only
             "READY"
             jobs are set to be run.
             See a description of states here:

             https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o
             f-fws-and-wfs

             Next, you probably want to try to debug
             why the FW is not READY yet. One thing you can try
             is querying the state of all the jobs within the
             workflow of the affected FW, e.g.:

             lpad get_wflows -i 31

             where "31" is the id of one of your
             affected FWS. This should print out a report of all
             states in your workflow and you can probably see
             immediately where things got stuck.

             If things are still not clear after that
             and you are confident that something might be wrong,
             type the command:

             lpad get_wflows -i 31 -d all

             and paste the result back into this list
             so I can take a look at all the links in the
             workflow and what is happening.

             Best,
             Anubhav

             On Monday, September 14, 2015 at > 11:05:01 AM UTC-7, Derek wrote:
             Hello,

             Over the weekend I submitted just
             over 500 jobs to Fireworks
             (this is the largest pipeline I've
             tried to date) and executed
             them using:

             qlaunch -r rapidfire --nlaunches
             infinite --sleep 60
             --maxjobs_queue 50

             All but 6 of them completed
             successfully and I'm trying to figure
             out what's happened with those 6.
             If I try "qlaunch" or
             "rlaunch", neither command
             recognizes that there are few jobs
             left to complete. For example:

             $ qlaunch -r singleshot
             2015-09-14 10:47:45,974 INFO No
             jobs exist in the LaunchPad for
             submission to queue!

             Here are some (hopefully) relevant
             details. I'm happy to provide
             more.

             lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s COMPLETED -d
             count
             508
             lpad get\_fws \-s WAITING \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s FIZZLED -d count
             0

             Which Fireworks are waiting?

             $ lpad get_fws -s WAITING | grep
             fw_id | sort
             "fw_id": 32,
             "fw_id": 33,
             "fw_id": 34,
             "fw_id": 35,
             "fw_id": 36,
             "fw_id": 37,

             What was Firework with fw_id 31,
             and what happend to it?

             $ lpad get_fws -i 31 -d more
             [See
             https://gist.github.com/anonymous/10ea08044f574d190625]

             Looking in the launch directory
             for fw_id 31 (and looking at the
             yaml file I used to submit my
             workflow), I know that the Firework
             with fw_id 31 should be (as far as
             I can tell) the only
             dependency for the Firework with
             fw_id 32.

             Was a launch directory ever
             created for the Firework with fw_id
             32? It appears not:

             grep \-rl &#39;&quot;fw\_id&quot;: 32,&#39; \./\* &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

             (the same is true for the other
             "waiting" Fireworks)

             If I try to rerun these Firworks,
             still no luck:

             lpad rerun\_fws \-i 32 lpad
             rerun_fws -i 32
             2015-09-14 10:50:56,652 INFO
             Finished setting 1 FWs to rerun

             $ qlaunch -r singleshot
             2015-09-14 10:52:37,092 INFO No
             jobs exist in the LaunchPad for
             submission to queue!

             Is there anything else I should
             try/check/examine?

             Thank you,

             Derek

             PS: Apologies that this post also
             got appended to my previous
             "qluanching dependent jobs" thread

Hi Derek,

I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as get_wf_by_fw_id(). Then it won’t use the “LazyFireWork” which has the caching.

But I think it is best if you can send a runnable example. It doesn’t need to be 100% reproducible, even if it fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.

Best,

Anubhav

···

On Thu, Sep 17, 2015 at 12:52 PM, Derek [email protected] wrote:

Hi Anubhav,

Well, my workflow didn’t fail in the same spot, so reproducibility isn’t deterministic. Some interesting things I’ve noticed:


$ lpad get_fws -d count

514

$ lpad get_fws -s COMPLETED -d count

514

So all jobs completed

$ lpad get_wflows -s RUNNING

{

"name": "unnamed WF--1",

"state": "RUNNING",

"states_list": "C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C",

"created_on": "2015-09-17T02:08:13.284000"

}

Assuming C means COMPLETED, it looks like everything is done.

However:

$ lpad get_wflows -s RUNNING -d all | grep RUNNING

    "20": "RUNNING",

    "29": "RUNNING",

    "56": "RUNNING",

    "110": "RUNNING",

    "83": "RUNNING",

    "65": "RUNNING",

    "94": "RUNNING",

    "11": "RUNNING",

"state": "RUNNING",

So it looks like maybe there’s some inconsistency between “lpad get_wflows” calls, depending on the options passed? Maybe that narrows down the culprit?

If we think that caching is a problem, is there a way to disable it in order to verify that?

Thanks,

Derek

On Wed, 16 Sep 2015, Anubhav Jain wrote:

Hi Derek,

Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don’t see anything past 1.04 that would have affected this issue one way or the other. So if you are seeing it

in 1.04 chances are that you will still see it in 1.1.3.

Best,

Anubhav

On Wed, Sep 16, 2015 at 9:36 AM, Derek [email protected] wrote:

  I experienced this bug using fireworks version 1.04.  Has any of the caching code changed since then?



  I'm happy to go through the git logs myself if I know which files to look in.



  I'm going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is.  If it is perfectly reproducible, I'll try it again in 1.1.3.  If it

  reproduces there, I'll make a bunch of test/dummy jobs and share that code with you.



  Regards,



  Derek



  On Tue, 15 Sep 2015, Anubhav Jain wrote:



        Hi Derek,

        Indeed, having a dummy workflow (with test tasks or with sleep tasks) that

        we can reproduce in its entirety would be ideal.



        Best,

        Anubhav



        On Tue, Sep 15, 2015 at 3:13 PM, Derek

        <[email protected]> wrote:

              Hi Anubhav,



              Here is a gist for the output of "lpad get_wflows -i 32 -d all":



              [https://gist.github.com/anonymous/df811ff4f1aec422e44d](https://gist.github.com/anonymous/df811ff4f1aec422e44d)



              The workflow that I submitted uses a bunch of external (to

              python) dependencies, so maybe not the easiest to replicate on

              another system.



              I could try to make a dummy workflow where the firetasks are

              sleep commands--that would be portable.  What exactly do you

              have in mind?  I'm guessing this could be a timing-related

              issue, but beyond that I'm not sure.



              Regards,



              Derek



              On Tue, 15 Sep 2015, Anubhav Jain wrote:



                    Hi Derek,

                    Thanks for reporting this - there is a potential for

                    the FW and WFlow states to not match; this is

                    because a contributor had pushed a change to cache

                    the FW states inside the WFlow to speed up the

                    overall operation of

                    FWS. This causes the storage of FW and WFlow states

                    to be separate. Unfortunately, if there is a bug in

                    this code, it would cause the problems that you see.



                    Do you happen to have any code example I can use to

                    reproduce this problem? Then I can include it as a

                    unit test and ask the contributor to fix it. We

                    already have a basic unit test for this

                    functionality and it passes,

                    so I need an example of how to force a failure.



                    Also, if the "-all" option is too long, you can also

                    use "lpad get_wflows -i 31 -d more" which will

                    provide more information but not nearly as much as

                    "all". 



                    Best,

                    Anubhav



                    On Tue, Sep 15, 2015 at 10:26 AM, Derek

                    <[email protected]> wrote:

                          Hi Anubhav,



                          Using lpad get_wflows, I see that one of the

                    other dependencies for fw_id 32, fw_id 30, is still

                    RUNNING according to:



                          $ lpad get_wflows -i 32 -d all | grep '"30"'

                          "30": "RUNNING",



                          However, lpad get_fws shows it as COMPLETED:



                          $ lpad get_fws -i 30

                          {

                              "name": "Unnamed FW",

                              "fw_id": 30,

                              "state": "COMPLETED",

                              "created_on":

                    "2015-09-10T16:45:38.431775",

                              "updated_on": "2015-09-10T16:52:00.453871"

                          }



                          The associated launcher_* for fw_id 30

                    directory has an empty .error file and a .out file

                    indicating that the Rocket finished and completed. 

                    The output that the job produced also looks correct

                    (I ran the

                          command manually and the results match).



                          It seems odd that get_wflows and get_fws show

                    different states for same job.



                          Happy to provide more info as necessary.  The

                    "-d all" option for lpad get_wflows produces a lot

                    of output (it's a big workflow), but I can include

                    it if it's helpful.



                          Regards,



                          Derek



                          On Mon, 14 Sep 2015, Anubhav Jain wrote:



                                Hi Derek,

                                The "WAITING" state indicates that there

                    is a job dependency, i.e. a parent job that the

                    current job is waiting to be COMPLETED. Those FWs

                    will not be picked up by rlaunch or qlaunch. Only

                    "READY"

                                jobs are set to be run.

                                See a description of states here:



                               [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o)

                    f-fws-and-wfs



                                Next, you probably want to try to debug

                    why the FW is not READY yet. One thing you can try

                    is querying the state of all the jobs within the

                    workflow of the affected FW, e.g.:



                                lpad get_wflows -i 31



                                where "31" is the id of one of your

                    affected FWS. This should print out a report of all

                    states in your workflow and you can probably see

                    immediately where things got stuck.



                                If things are still not clear after that

                    and you are confident that something might be wrong,

                    type the command:



                                lpad get_wflows -i 31 -d all



                                and paste the result back into this list

                    so I can take a look at all the links in the

                    workflow and what is happening.



                                Best,

                                Anubhav





                                On Monday, September 14, 2015 at

                    11:05:01 AM UTC-7, Derek wrote:

                                      Hello,



                                      Over the weekend I submitted just

                    over 500 jobs to Fireworks

                                      (this is the largest pipeline I've

                    tried to date) and executed

                                      them using:



                                      qlaunch -r rapidfire --nlaunches

                    infinite --sleep 60

                                      --maxjobs_queue 50



                                      All but 6 of them completed

                    successfully and I'm trying to figure

                                      out what's happened with those 6. 

                    If I try "qlaunch" or

                                      "rlaunch", neither command

                    recognizes that there are few jobs

                                      left to complete.  For example:



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:47:45,974 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Here are some (hopefully) relevant

                    details.  I'm happy to provide

                                      more.



                                      $ lpad get_fws -d count

                                      514

                                      $ lpad get_fws -s COMPLETED -d

                    count

                                      508

                                      $ lpad get_fws -s WAITING -d count

                                      6

                                      $ lpad get_fws -s FIZZLED -d count

                                      0



                                      Which Fireworks are waiting?



                                      $ lpad get_fws -s WAITING | grep

                    fw_id | sort

                                      "fw_id": 32,

                                      "fw_id": 33,

                                      "fw_id": 34,

                                      "fw_id": 35,

                                      "fw_id": 36,

                                      "fw_id": 37,



                                      What was Firework with fw_id 31,

                    and what happend to it?



                                      $ lpad get_fws -i 31 -d more

                                      [See

                    [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



                                      Looking in the launch directory

                    for fw_id 31 (and looking at the

                                      yaml file I used to submit my

                    workflow), I know that the Firework

                                      with fw_id 31 should be (as far as

                    I can tell) the only

                                      dependency for the Firework with

                    fw_id 32.



                                      Was a launch directory ever

                    created for the Firework with fw_id

                                      32?  It appears not:



                                      $ grep -rl '"fw_id": 32,' ./*

                                      $



                                      (the same is true for the other

                    "waiting" Fireworks)



                                      If I try to rerun these Firworks,

                    still no luck:



                                      $ lpad rerun_fws -i 32$ lpad

                    rerun_fws -i 32

                                      2015-09-14 10:50:56,652 INFO

                    Finished setting 1 FWs to rerun



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:52:37,092 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Is there anything else I should

                    try/check/examine?



                                      Thank you,



                                      Derek



                                      PS: Apologies that this post also

                    got appended to my previous

                                      "qluanching dependent jobs" thread

Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic FWActions?

Best,

Anubhav

···

On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain [email protected] wrote:

Hi Derek,

I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as get_wf_by_fw_id(). Then it won’t use the “LazyFireWork” which has the caching.

But I think it is best if you can send a runnable example. It doesn’t need to be 100% reproducible, even if it fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.

Best,

Anubhav

On Thu, Sep 17, 2015 at 12:52 PM, Derek [email protected] wrote:

Hi Anubhav,

Well, my workflow didn’t fail in the same spot, so reproducibility isn’t deterministic. Some interesting things I’ve noticed:


$ lpad get_fws -d count

514

$ lpad get_fws -s COMPLETED -d count

514

So all jobs completed

$ lpad get_wflows -s RUNNING

{

"name": "unnamed WF--1",

"state": "RUNNING",

"states_list": "C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C",

"created_on": "2015-09-17T02:08:13.284000"

}

Assuming C means COMPLETED, it looks like everything is done.

However:

$ lpad get_wflows -s RUNNING -d all | grep RUNNING

    "20": "RUNNING",

    "29": "RUNNING",

    "56": "RUNNING",

    "110": "RUNNING",

    "83": "RUNNING",

    "65": "RUNNING",

    "94": "RUNNING",

    "11": "RUNNING",

"state": "RUNNING",

So it looks like maybe there’s some inconsistency between “lpad get_wflows” calls, depending on the options passed? Maybe that narrows down the culprit?

If we think that caching is a problem, is there a way to disable it in order to verify that?

Thanks,

Derek

On Wed, 16 Sep 2015, Anubhav Jain wrote:

Hi Derek,

Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don’t see anything past 1.04 that would have affected this issue one way or the other. So if you are seeing it

in 1.04 chances are that you will still see it in 1.1.3.

Best,

Anubhav

On Wed, Sep 16, 2015 at 9:36 AM, Derek [email protected] wrote:

  I experienced this bug using fireworks version 1.04.  Has any of the caching code changed since then?



  I'm happy to go through the git logs myself if I know which files to look in.



  I'm going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is.  If it is perfectly reproducible, I'll try it again in 1.1.3.  If it

  reproduces there, I'll make a bunch of test/dummy jobs and share that code with you.



  Regards,



  Derek



  On Tue, 15 Sep 2015, Anubhav Jain wrote:



        Hi Derek,

        Indeed, having a dummy workflow (with test tasks or with sleep tasks) that

        we can reproduce in its entirety would be ideal.



        Best,

        Anubhav



        On Tue, Sep 15, 2015 at 3:13 PM, Derek

        <[email protected]> wrote:

              Hi Anubhav,



              Here is a gist for the output of "lpad get_wflows -i 32 -d all":



              [https://gist.github.com/anonymous/df811ff4f1aec422e44d](https://gist.github.com/anonymous/df811ff4f1aec422e44d)



              The workflow that I submitted uses a bunch of external (to

              python) dependencies, so maybe not the easiest to replicate on

              another system.



              I could try to make a dummy workflow where the firetasks are

              sleep commands--that would be portable.  What exactly do you

              have in mind?  I'm guessing this could be a timing-related

              issue, but beyond that I'm not sure.



              Regards,



              Derek



              On Tue, 15 Sep 2015, Anubhav Jain wrote:



                    Hi Derek,

                    Thanks for reporting this - there is a potential for

                    the FW and WFlow states to not match; this is

                    because a contributor had pushed a change to cache

                    the FW states inside the WFlow to speed up the

                    overall operation of

                    FWS. This causes the storage of FW and WFlow states

                    to be separate. Unfortunately, if there is a bug in

                    this code, it would cause the problems that you see.



                    Do you happen to have any code example I can use to

                    reproduce this problem? Then I can include it as a

                    unit test and ask the contributor to fix it. We

                    already have a basic unit test for this

                    functionality and it passes,

                    so I need an example of how to force a failure.



                    Also, if the "-all" option is too long, you can also

                    use "lpad get_wflows -i 31 -d more" which will

                    provide more information but not nearly as much as

                    "all". 



                    Best,

                    Anubhav



                    On Tue, Sep 15, 2015 at 10:26 AM, Derek

                    <[email protected]> wrote:

                          Hi Anubhav,



                          Using lpad get_wflows, I see that one of the

                    other dependencies for fw_id 32, fw_id 30, is still

                    RUNNING according to:



                          $ lpad get_wflows -i 32 -d all | grep '"30"'

                          "30": "RUNNING",



                          However, lpad get_fws shows it as COMPLETED:



                          $ lpad get_fws -i 30

                          {

                              "name": "Unnamed FW",

                              "fw_id": 30,

                              "state": "COMPLETED",

                              "created_on":

                    "2015-09-10T16:45:38.431775",

                              "updated_on": "2015-09-10T16:52:00.453871"

                          }



                          The associated launcher_* for fw_id 30

                    directory has an empty .error file and a .out file

                    indicating that the Rocket finished and completed. 

                    The output that the job produced also looks correct

                    (I ran the

                          command manually and the results match).



                          It seems odd that get_wflows and get_fws show

                    different states for same job.



                          Happy to provide more info as necessary.  The

                    "-d all" option for lpad get_wflows produces a lot

                    of output (it's a big workflow), but I can include

                    it if it's helpful.



                          Regards,



                          Derek



                          On Mon, 14 Sep 2015, Anubhav Jain wrote:



                                Hi Derek,

                                The "WAITING" state indicates that there

                    is a job dependency, i.e. a parent job that the

                    current job is waiting to be COMPLETED. Those FWs

                    will not be picked up by rlaunch or qlaunch. Only

                    "READY"

                                jobs are set to be run.

                                See a description of states here:



                               [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o)

                    f-fws-and-wfs



                                Next, you probably want to try to debug

                    why the FW is not READY yet. One thing you can try

                    is querying the state of all the jobs within the

                    workflow of the affected FW, e.g.:



                                lpad get_wflows -i 31



                                where "31" is the id of one of your

                    affected FWS. This should print out a report of all

                    states in your workflow and you can probably see

                    immediately where things got stuck.



                                If things are still not clear after that

                    and you are confident that something might be wrong,

                    type the command:



                                lpad get_wflows -i 31 -d all



                                and paste the result back into this list

                    so I can take a look at all the links in the

                    workflow and what is happening.



                                Best,

                                Anubhav





                                On Monday, September 14, 2015 at

                    11:05:01 AM UTC-7, Derek wrote:

                                      Hello,



                                      Over the weekend I submitted just

                    over 500 jobs to Fireworks

                                      (this is the largest pipeline I've

                    tried to date) and executed

                                      them using:



                                      qlaunch -r rapidfire --nlaunches

                    infinite --sleep 60

                                      --maxjobs_queue 50



                                      All but 6 of them completed

                    successfully and I'm trying to figure

                                      out what's happened with those 6. 

                    If I try "qlaunch" or

                                      "rlaunch", neither command

                    recognizes that there are few jobs

                                      left to complete.  For example:



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:47:45,974 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Here are some (hopefully) relevant

                    details.  I'm happy to provide

                                      more.



                                      $ lpad get_fws -d count

                                      514

                                      $ lpad get_fws -s COMPLETED -d

                    count

                                      508

                                      $ lpad get_fws -s WAITING -d count

                                      6

                                      $ lpad get_fws -s FIZZLED -d count

                                      0



                                      Which Fireworks are waiting?



                                      $ lpad get_fws -s WAITING | grep

                    fw_id | sort

                                      "fw_id": 32,

                                      "fw_id": 33,

                                      "fw_id": 34,

                                      "fw_id": 35,

                                      "fw_id": 36,

                                      "fw_id": 37,



                                      What was Firework with fw_id 31,

                    and what happend to it?



                                      $ lpad get_fws -i 31 -d more

                                      [See

                    [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



                                      Looking in the launch directory

                    for fw_id 31 (and looking at the

                                      yaml file I used to submit my

                    workflow), I know that the Firework

                                      with fw_id 31 should be (as far as

                    I can tell) the only

                                      dependency for the Firework with

                    fw_id 32.



                                      Was a launch directory ever

                    created for the Firework with fw_id

                                      32?  It appears not:



                                      $ grep -rl '"fw_id": 32,' ./*

                                      $



                                      (the same is true for the other

                    "waiting" Fireworks)



                                      If I try to rerun these Firworks,

                    still no luck:



                                      $ lpad rerun_fws -i 32$ lpad

                    rerun_fws -i 32

                                      2015-09-14 10:50:56,652 INFO

                    Finished setting 1 FWs to rerun



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:52:37,092 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Is there anything else I should

                    try/check/examine?



                                      Thank you,



                                      Derek



                                      PS: Apologies that this post also

                    got appended to my previous

                                      "qluanching dependent jobs" thread

Also, another way to confirm that the caching is the problem is to inspect the MongoDB.

i) Connect to MongoDB

ii) Look at the workflows collection inside your fireworks database

iii) Examine the “wf_states” key, and see if you find any discrepancies between that and the fireworks data.

But again, sending an example is the only way to really debug this.

···

On Fri, Sep 18, 2015 at 11:21 AM, Anubhav Jain [email protected] wrote:

Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic FWActions?

Best,

Anubhav

On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain [email protected] wrote:

Hi Derek,

I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as get_wf_by_fw_id(). Then it won’t use the “LazyFireWork” which has the caching.

But I think it is best if you can send a runnable example. It doesn’t need to be 100% reproducible, even if it fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.

Best,

Anubhav

On Thu, Sep 17, 2015 at 12:52 PM, Derek [email protected]m wrote:

Hi Anubhav,

Well, my workflow didn’t fail in the same spot, so reproducibility isn’t deterministic. Some interesting things I’ve noticed:


$ lpad get_fws -d count

514

$ lpad get_fws -s COMPLETED -d count

514

So all jobs completed

$ lpad get_wflows -s RUNNING

{

"name": "unnamed WF--1",

"state": "RUNNING",

"states_list": "C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-

C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C",

"created_on": "2015-09-17T02:08:13.284000"

}

Assuming C means COMPLETED, it looks like everything is done.

However:

$ lpad get_wflows -s RUNNING -d all | grep RUNNING

    "20": "RUNNING",

    "29": "RUNNING",

    "56": "RUNNING",

    "110": "RUNNING",

    "83": "RUNNING",

    "65": "RUNNING",

    "94": "RUNNING",

    "11": "RUNNING",

"state": "RUNNING",

So it looks like maybe there’s some inconsistency between “lpad get_wflows” calls, depending on the options passed? Maybe that narrows down the culprit?

If we think that caching is a problem, is there a way to disable it in order to verify that?

Thanks,

Derek

On Wed, 16 Sep 2015, Anubhav Jain wrote:

Hi Derek,

Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don’t see anything past 1.04 that would have affected this issue one way or the other. So if you are seeing it

in 1.04 chances are that you will still see it in 1.1.3.

Best,

Anubhav

On Wed, Sep 16, 2015 at 9:36 AM, Derek [email protected] wrote:

  I experienced this bug using fireworks version 1.04.  Has any of the caching code changed since then?



  I'm happy to go through the git logs myself if I know which files to look in.



  I'm going to first rerun my entire workflow (using version 1.04) and see how reproducible the behavior is.  If it is perfectly reproducible, I'll try it again in 1.1.3.  If it

  reproduces there, I'll make a bunch of test/dummy jobs and share that code with you.



  Regards,



  Derek



  On Tue, 15 Sep 2015, Anubhav Jain wrote:



        Hi Derek,

        Indeed, having a dummy workflow (with test tasks or with sleep tasks) that

        we can reproduce in its entirety would be ideal.



        Best,

        Anubhav



        On Tue, Sep 15, 2015 at 3:13 PM, Derek

        <[email protected]> wrote:

              Hi Anubhav,



              Here is a gist for the output of "lpad get_wflows -i 32 -d all":



              [https://gist.github.com/anonymous/df811ff4f1aec422e44d](https://gist.github.com/anonymous/df811ff4f1aec422e44d)



              The workflow that I submitted uses a bunch of external (to

              python) dependencies, so maybe not the easiest to replicate on

              another system.



              I could try to make a dummy workflow where the firetasks are

              sleep commands--that would be portable.  What exactly do you

              have in mind?  I'm guessing this could be a timing-related

              issue, but beyond that I'm not sure.



              Regards,



              Derek



              On Tue, 15 Sep 2015, Anubhav Jain wrote:



                    Hi Derek,

                    Thanks for reporting this - there is a potential for

                    the FW and WFlow states to not match; this is

                    because a contributor had pushed a change to cache

                    the FW states inside the WFlow to speed up the

                    overall operation of

                    FWS. This causes the storage of FW and WFlow states

                    to be separate. Unfortunately, if there is a bug in

                    this code, it would cause the problems that you see.



                    Do you happen to have any code example I can use to

                    reproduce this problem? Then I can include it as a

                    unit test and ask the contributor to fix it. We

                    already have a basic unit test for this

                    functionality and it passes,

                    so I need an example of how to force a failure.



                    Also, if the "-all" option is too long, you can also

                    use "lpad get_wflows -i 31 -d more" which will

                    provide more information but not nearly as much as

                    "all". 



                    Best,

                    Anubhav



                    On Tue, Sep 15, 2015 at 10:26 AM, Derek

                    <[email protected]> wrote:

                          Hi Anubhav,



                          Using lpad get_wflows, I see that one of the

                    other dependencies for fw_id 32, fw_id 30, is still

                    RUNNING according to:



                          $ lpad get_wflows -i 32 -d all | grep '"30"'

                          "30": "RUNNING",



                          However, lpad get_fws shows it as COMPLETED:



                          $ lpad get_fws -i 30

                          {

                              "name": "Unnamed FW",

                              "fw_id": 30,

                              "state": "COMPLETED",

                              "created_on":

                    "2015-09-10T16:45:38.431775",

                              "updated_on": "2015-09-10T16:52:00.453871"

                          }



                          The associated launcher_* for fw_id 30

                    directory has an empty .error file and a .out file

                    indicating that the Rocket finished and completed. 

                    The output that the job produced also looks correct

                    (I ran the

                          command manually and the results match).



                          It seems odd that get_wflows and get_fws show

                    different states for same job.



                          Happy to provide more info as necessary.  The

                    "-d all" option for lpad get_wflows produces a lot

                    of output (it's a big workflow), but I can include

                    it if it's helpful.



                          Regards,



                          Derek



                          On Mon, 14 Sep 2015, Anubhav Jain wrote:



                                Hi Derek,

                                The "WAITING" state indicates that there

                    is a job dependency, i.e. a parent job that the

                    current job is waiting to be COMPLETED. Those FWs

                    will not be picked up by rlaunch or qlaunch. Only

                    "READY"

                                jobs are set to be run.

                                See a description of states here:



                               [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o)

                    f-fws-and-wfs



                                Next, you probably want to try to debug

                    why the FW is not READY yet. One thing you can try

                    is querying the state of all the jobs within the

                    workflow of the affected FW, e.g.:



                                lpad get_wflows -i 31



                                where "31" is the id of one of your

                    affected FWS. This should print out a report of all

                    states in your workflow and you can probably see

                    immediately where things got stuck.



                                If things are still not clear after that

                    and you are confident that something might be wrong,

                    type the command:



                                lpad get_wflows -i 31 -d all



                                and paste the result back into this list

                    so I can take a look at all the links in the

                    workflow and what is happening.



                                Best,

                                Anubhav





                                On Monday, September 14, 2015 at

                    11:05:01 AM UTC-7, Derek wrote:

                                      Hello,



                                      Over the weekend I submitted just

                    over 500 jobs to Fireworks

                                      (this is the largest pipeline I've

                    tried to date) and executed

                                      them using:



                                      qlaunch -r rapidfire --nlaunches

                    infinite --sleep 60

                                      --maxjobs_queue 50



                                      All but 6 of them completed

                    successfully and I'm trying to figure

                                      out what's happened with those 6. 

                    If I try "qlaunch" or

                                      "rlaunch", neither command

                    recognizes that there are few jobs

                                      left to complete.  For example:



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:47:45,974 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Here are some (hopefully) relevant

                    details.  I'm happy to provide

                                      more.



                                      $ lpad get_fws -d count

                                      514

                                      $ lpad get_fws -s COMPLETED -d

                    count

                                      508

                                      $ lpad get_fws -s WAITING -d count

                                      6

                                      $ lpad get_fws -s FIZZLED -d count

                                      0



                                      Which Fireworks are waiting?



                                      $ lpad get_fws -s WAITING | grep

                    fw_id | sort

                                      "fw_id": 32,

                                      "fw_id": 33,

                                      "fw_id": 34,

                                      "fw_id": 35,

                                      "fw_id": 36,

                                      "fw_id": 37,



                                      What was Firework with fw_id 31,

                    and what happend to it?



                                      $ lpad get_fws -i 31 -d more

                                      [See

                    [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]



                                      Looking in the launch directory

                    for fw_id 31 (and looking at the

                                      yaml file I used to submit my

                    workflow), I know that the Firework

                                      with fw_id 31 should be (as far as

                    I can tell) the only

                                      dependency for the Firework with

                    fw_id 32.



                                      Was a launch directory ever

                    created for the Firework with fw_id

                                      32?  It appears not:



                                      $ grep -rl '"fw_id": 32,' ./*

                                      $



                                      (the same is true for the other

                    "waiting" Fireworks)



                                      If I try to rerun these Firworks,

                    still no luck:



                                      $ lpad rerun_fws -i 32$ lpad

                    rerun_fws -i 32

                                      2015-09-14 10:50:56,652 INFO

                    Finished setting 1 FWs to rerun



                                      $ qlaunch -r singleshot

                                      2015-09-14 10:52:37,092 INFO No

                    jobs exist in the LaunchPad for

                                      submission to queue!



                                      Is there anything else I should

                    try/check/examine?



                                      Thank you,



                                      Derek



                                      PS: Apologies that this post also

                    got appended to my previous

                                      "qluanching dependent jobs" thread

Ok, I'll try and produce a shareable version that reproduces the behavior.

I'm not doing anything fancy like dynamic FWActions. Data gets passed between Fireworks and their FireTasks through (predetermined) files.

Thanks,

Derek

···

On Fri, 18 Sep 2015, Anubhav Jain wrote:

Also, another way to confirm that the caching is the problem is to inspect the MongoDB.
i) Connect to MongoDB
ii) Look at the workflows collection inside your fireworks database
iii) Examine the "wf_states" key, and see if you find any discrepancies between that and the fireworks data.

But again, sending an example is the only way to really debug this.

On Fri, Sep 18, 2015 at 11:21 AM, Anubhav Jain <[email protected]> wrote:
      Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic
      FWActions?
Best,
Anubhav

On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain <[email protected]> wrote:
      Hi Derek,

I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying
the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as
get_wf_by_fw_id(). Then it won't use the "LazyFireWork" which has the caching.

*But* I think it is best if you can send a runnable example. It doesn't need to be 100% reproducible, even if it
fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.

Best,
Anubhav

On Thu, Sep 17, 2015 at 12:52 PM, Derek <[email protected]> wrote:
      Hi Anubhav,

      Well, my workflow didn't fail in the same spot, so reproducibility isn't deterministic. Some
      interesting things I've noticed:

      ----------

       lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s COMPLETED -d count
      514
      # So all jobs completed

       lpad get\_wflows \-s RUNNING &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\{ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;name&quot;: &quot;unnamed WF\-\-1&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;state&quot;: &quot;RUNNING&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;states\_list&quot;: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;created\_on&quot;: &quot;2015\-09\-17T02:08:13\.284000&quot; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\} &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\# Assuming C means COMPLETED, it looks like everything is done\. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\# However: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_wflows -s RUNNING -d all | grep RUNNING
       "20": "RUNNING",
       "29": "RUNNING",
       "56": "RUNNING",
       "110": "RUNNING",
       "83": "RUNNING",
       "65": "RUNNING",
       "94": "RUNNING",
       "11": "RUNNING",
       "state": "RUNNING",

      ----------

      So it looks like maybe there's some inconsistency between "lpad get_wflows" calls, depending on the
      options passed? Maybe that narrows down the culprit?

      If we think that caching is a problem, is there a way to disable it in order to verify that?

      Thanks,

      Derek

      On Wed, 16 Sep 2015, Anubhav Jain wrote:

            Hi Derek,
            Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don't see
            anything past 1.04 that would have affected this issue one way or the other. So if you
            are seeing it
            in 1.04 chances are that you will still see it in 1.1.3.

            Best,
            Anubhav

            On Wed, Sep 16, 2015 at 9:36 AM, Derek <[email protected]> > wrote:
             I experienced this bug using fireworks version 1.04. Has any of the caching code
            changed since then?

             I'm happy to go through the git logs myself if I know which files to look in.

             I'm going to first rerun my entire workflow (using version 1.04) and see how
            reproducible the behavior is. If it is perfectly reproducible, I'll try it again in
            1.1.3. If it
             reproduces there, I'll make a bunch of test/dummy jobs and share that code with
            you.

             Regards,

             Derek

             On Tue, 15 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
             we can reproduce in its entirety would be ideal.

             Best,
             Anubhav

             On Tue, Sep 15, 2015 at 3:13 PM, Derek > <[email protected]> wrote:
             Hi Anubhav,

             Here is a gist for the output of "lpad get_wflows -i 32 -d all":

             https://gist.github.com/anonymous/df811ff4f1aec422e44d

             The workflow that I submitted uses a bunch of external (to
             python) dependencies, so maybe not the easiest to replicate on
             another system.

             I could try to make a dummy workflow where the firetasks are
             sleep commands--that would be portable. What exactly do you
             have in mind? I'm guessing this could be a timing-related
             issue, but beyond that I'm not sure.

             Regards,

             Derek

             On Tue, 15 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             Thanks for reporting this - there is a potential for
             the FW and WFlow states to not match; this is
             because a contributor had pushed a change to cache
             the FW states inside the WFlow to speed up the
             overall operation of
             FWS. This causes the storage of FW and WFlow states
             to be separate. Unfortunately, if there is a bug in
             this code, it would cause the problems that you see.

             Do you happen to have any code example I can use to
             reproduce this problem? Then I can include it as a
             unit test and ask the contributor to fix it. We
             already have a basic unit test for this
             functionality and it passes,
             so I need an example of how to force a failure.

             Also, if the "-all" option is too long, you can also
             use "lpad get_wflows -i 31 -d more" which will
             provide more information but not nearly as much as
             "all".

             Best,
             Anubhav

             On Tue, Sep 15, 2015 at 10:26 AM, Derek > <[email protected]> wrote:
             Hi Anubhav,

             Using lpad get_wflows, I see that one of the
             other dependencies for fw_id 32, fw_id 30, is still
             RUNNING according to:

             $ lpad get_wflows -i 32 -d all | grep '"30"'
             "30": "RUNNING",

             However, lpad get_fws shows it as COMPLETED:

             $ lpad get_fws -i 30
             {
             "name": "Unnamed FW",
             "fw_id": 30,
             "state": "COMPLETED",
             "created_on":
             "2015-09-10T16:45:38.431775",
             "updated_on": "2015-09-10T16:52:00.453871"
             }

             The associated launcher_* for fw_id 30
             directory has an empty .error file and a .out file
             indicating that the Rocket finished and completed.
             The output that the job produced also looks correct
             (I ran the
             command manually and the results match).

             It seems odd that get_wflows and get_fws show
             different states for same job.

             Happy to provide more info as necessary. The
             "-d all" option for lpad get_wflows produces a lot
             of output (it's a big workflow), but I can include
             it if it's helpful.

             Regards,

             Derek

             On Mon, 14 Sep 2015, Anubhav Jain wrote:

             Hi Derek,
             The "WAITING" state indicates that there
             is a job dependency, i.e. a parent job that the
             current job is waiting to be COMPLETED. Those FWs
             will not be picked up by rlaunch or qlaunch. Only
             "READY"
             jobs are set to be run.
             See a description of states here:

            
             https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o
             f-fws-and-wfs

             Next, you probably want to try to debug
             why the FW is not READY yet. One thing you can try
             is querying the state of all the jobs within the
             workflow of the affected FW, e.g.:

             lpad get_wflows -i 31

             where "31" is the id of one of your
             affected FWS. This should print out a report of all
             states in your workflow and you can probably see
             immediately where things got stuck.

             If things are still not clear after that
             and you are confident that something might be wrong,
             type the command:

             lpad get_wflows -i 31 -d all

             and paste the result back into this list
             so I can take a look at all the links in the
             workflow and what is happening.

             Best,
             Anubhav

             On Monday, September 14, 2015 at > 11:05:01 AM UTC-7, Derek wrote:
             Hello,

             Over the weekend I submitted just
             over 500 jobs to Fireworks
             (this is the largest pipeline I've
             tried to date) and executed
             them using:

             qlaunch -r rapidfire --nlaunches
             infinite --sleep 60
             --maxjobs_queue 50

             All but 6 of them completed
             successfully and I'm trying to figure
             out what's happened with those 6.
             If I try "qlaunch" or
             "rlaunch", neither command
             recognizes that there are few jobs
             left to complete. For example:

             $ qlaunch -r singleshot
             2015-09-14 10:47:45,974 INFO No
             jobs exist in the LaunchPad for
             submission to queue!

             Here are some (hopefully) relevant
             details. I'm happy to provide
             more.

             lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s COMPLETED -d
             count
             508
             lpad get\_fws \-s WAITING \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lpad get_fws -s FIZZLED -d count
             0

             Which Fireworks are waiting?

             $ lpad get_fws -s WAITING | grep
             fw_id | sort
             "fw_id": 32,
             "fw_id": 33,
             "fw_id": 34,
             "fw_id": 35,
             "fw_id": 36,
             "fw_id": 37,

             What was Firework with fw_id 31,
             and what happend to it?

             $ lpad get_fws -i 31 -d more
             [See
             https://gist.github.com/anonymous/10ea08044f574d190625]

             Looking in the launch directory
             for fw_id 31 (and looking at the
             yaml file I used to submit my
             workflow), I know that the Firework
             with fw_id 31 should be (as far as
             I can tell) the only
             dependency for the Firework with
             fw_id 32.

             Was a launch directory ever
             created for the Firework with fw_id
             32? It appears not:

             grep \-rl &#39;&quot;fw\_id&quot;: 32,&#39; \./\* &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

             (the same is true for the other
             "waiting" Fireworks)

             If I try to rerun these Firworks,
             still no luck:

             lpad rerun\_fws \-i 32 lpad
             rerun_fws -i 32
             2015-09-14 10:50:56,652 INFO
             Finished setting 1 FWs to rerun

             $ qlaunch -r singleshot
             2015-09-14 10:52:37,092 INFO No
             jobs exist in the LaunchPad for
             submission to queue!

             Is there anything else I should
             try/check/examine?

             Thank you,

             Derek

             PS: Apologies that this post also
             got appended to my previous
             "qluanching dependent jobs" thread

Hi Derek,

One of the FWS developer noted a problem in WFLock that might be causing your issue. A fix is in v1.1.7, the latest version.

Some things you can try:

  • v1.1.7 should hopefully run your workflows reliably without problem

  • If there are problems, the “lpad detect_lostruns” command should be able to detect inconsistent workflows. Furthermore, you can fix them using “lpad detect_lostruns --refresh”. This should fix any workflows from previous FWS versions, but hopefully this should not be needed in v1.1.7.

Best,

Anubhav

···

On Friday, September 18, 2015 at 10:12:48 AM UTC-7, Derek wrote:

Ok, I’ll try and produce a shareable version that reproduces the
behavior.

I’m not doing anything fancy like dynamic FWActions. Data gets
passed between Fireworks and their FireTasks through
(predetermined) files.

Thanks,

Derek

On Fri, 18 Sep 2015, Anubhav Jain wrote:

Also, another way to confirm that the caching is the problem is to inspect the MongoDB.

i) Connect to MongoDB

ii) Look at the workflows collection inside your fireworks database

iii) Examine the “wf_states” key, and see if you find any discrepancies between that and the fireworks data.

But again, sending an example is the only way to really debug this.

On Fri, Sep 18, 2015 at 11:21 AM, Anubhav Jain [email protected] wrote:

  Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic
  FWActions?

Best,

Anubhav

On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain [email protected] wrote:

  Hi Derek,

I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying

the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as

get_wf_by_fw_id(). Then it won’t use the “LazyFireWork” which has the caching.

But I think it is best if you can send a runnable example. It doesn’t need to be 100% reproducible, even if it

fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.

Best,

Anubhav

On Thu, Sep 17, 2015 at 12:52 PM, Derek [email protected] wrote:

  Hi Anubhav,
  Well, my workflow didn't fail in the same spot, so reproducibility isn't deterministic.  Some
  interesting things I've noticed:
  ----------
  $ lpad get_fws -d count
  514
  $ lpad get_fws -s COMPLETED -d count
  514
  # So all jobs completed
  $ lpad get_wflows -s RUNNING
  {
      "name": "unnamed WF--1",
      "state": "RUNNING",
      "states_list":
  "C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-
  C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-
  C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C",
      "created_on": "2015-09-17T02:08:13.284000"
  }
  # Assuming C means COMPLETED, it looks like everything is done.
  # However:
  $ lpad get_wflows -s RUNNING -d all | grep RUNNING
          "20": "RUNNING",
          "29": "RUNNING",
          "56": "RUNNING",
          "110": "RUNNING",
          "83": "RUNNING",
          "65": "RUNNING",
          "94": "RUNNING",
          "11": "RUNNING",
      "state": "RUNNING",
  ----------
  So it looks like maybe there's some inconsistency between "lpad get_wflows" calls, depending on the
  options passed?  Maybe that narrows down the culprit?
  If we think that caching is a problem, is there a way to disable it in order to verify that?
  Thanks,
  Derek
  On Wed, 16 Sep 2015, Anubhav Jain wrote:
        Hi Derek,
        Examining the changelog ([https://pythonhosted.org/FireWorks/changelog.html](https://pythonhosted.org/FireWorks/changelog.html)), I don't see
        anything past 1.04 that would have affected this issue one way or the other. So if you
        are seeing it
        in 1.04 chances are that you will still see it in 1.1.3.
        Best,
        Anubhav
        On Wed, Sep 16, 2015 at 9:36 AM, Derek <[email protected]> >  > >             wrote:
              I experienced this bug using fireworks version 1.04.  Has any of the caching code
        changed since then?
              I'm happy to go through the git logs myself if I know which files to look in.
              I'm going to first rerun my entire workflow (using version 1.04) and see how
        reproducible the behavior is.  If it is perfectly reproducible, I'll try it again in
        1.1.3.  If it
              reproduces there, I'll make a bunch of test/dummy jobs and share that code with
        you.
              Regards,
              Derek
              On Tue, 15 Sep 2015, Anubhav Jain wrote:
                    Hi Derek,
                    Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
                    we can reproduce in its entirety would be ideal.
                    Best,
                    Anubhav
                    On Tue, Sep 15, 2015 at 3:13 PM, Derek >  > >                         <[email protected]> wrote:
                          Hi Anubhav,
                          Here is a gist for the output of "lpad get_wflows -i 32 -d all":
                          [https://gist.github.com/anonymous/df811ff4f1aec422e44d](https://gist.github.com/anonymous/df811ff4f1aec422e44d)
                          The workflow that I submitted uses a bunch of external (to
                          python) dependencies, so maybe not the easiest to replicate on
                          another system.
                          I could try to make a dummy workflow where the firetasks are
                          sleep commands--that would be portable.  What exactly do you
                          have in mind?  I'm guessing this could be a timing-related
                          issue, but beyond that I'm not sure.
                          Regards,
                          Derek
                          On Tue, 15 Sep 2015, Anubhav Jain wrote:
                                Hi Derek,
                                Thanks for reporting this - there is a potential for
                                the FW and WFlow states to not match; this is
                                because a contributor had pushed a change to cache
                                the FW states inside the WFlow to speed up the
                                overall operation of
                                FWS. This causes the storage of FW and WFlow states
                                to be separate. Unfortunately, if there is a bug in
                                this code, it would cause the problems that you see.
                                Do you happen to have any code example I can use to
                                reproduce this problem? Then I can include it as a
                                unit test and ask the contributor to fix it. We
                                already have a basic unit test for this
                                functionality and it passes,
                                so I need an example of how to force a failure.
                                Also, if the "-all" option is too long, you can also
                                use "lpad get_wflows -i 31 -d more" which will
                                provide more information but not nearly as much as
                                "all". 
                                Best,
                                Anubhav
                                On Tue, Sep 15, 2015 at 10:26 AM, Derek >  > >                                     <[email protected]> wrote:
                                      Hi Anubhav,
                                      Using lpad get_wflows, I see that one of the
                                other dependencies for fw_id 32, fw_id 30, is still
                                RUNNING according to:
                                      $ lpad get_wflows -i 32 -d all | grep '"30"'
                                      "30": "RUNNING",
                                      However, lpad get_fws shows it as COMPLETED:
                                      $ lpad get_fws -i 30
                                      {
                                          "name": "Unnamed FW",
                                          "fw_id": 30,
                                          "state": "COMPLETED",
                                          "created_on":
                                "2015-09-10T16:45:38.431775",
                                          "updated_on": "2015-09-10T16:52:00.453871"
                                      }
                                      The associated launcher_* for fw_id 30
                                directory has an empty .error file and a .out file
                                indicating that the Rocket finished and completed. 
                                The output that the job produced also looks correct
                                (I ran the
                                      command manually and the results match).
                                      It seems odd that get_wflows and get_fws show
                                different states for same job.
                                      Happy to provide more info as necessary.  The
                                "-d all" option for lpad get_wflows produces a lot
                                of output (it's a big workflow), but I can include
                                it if it's helpful.
                                      Regards,
                                      Derek
                                      On Mon, 14 Sep 2015, Anubhav Jain wrote:
                                            Hi Derek,
                                            The "WAITING" state indicates that there
                                is a job dependency, i.e. a parent job that the
                                current job is waiting to be COMPLETED. Those FWs
                                will not be picked up by rlaunch or qlaunch. Only
                                "READY"
                                            jobs are set to be run.
                                            See a description of states here:
         [https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o](https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o)
                                f-fws-and-wfs
                                            Next, you probably want to try to debug
                                why the FW is not READY yet. One thing you can try
                                is querying the state of all the jobs within the
                                workflow of the affected FW, e.g.:
                                            lpad get_wflows -i 31
                                            where "31" is the id of one of your
                                affected FWS. This should print out a report of all
                                states in your workflow and you can probably see
                                immediately where things got stuck.
                                            If things are still not clear after that
                                and you are confident that something might be wrong,
                                type the command:
                                            lpad get_wflows -i 31 -d all
                                            and paste the result back into this list
                                so I can take a look at all the links in the
                                workflow and what is happening.
                                            Best,
                                            Anubhav
                                            On Monday, September 14, 2015 at >  > >                                     11:05:01 AM UTC-7, Derek wrote:
                                                  Hello,
                                                  Over the weekend I submitted just
                                over 500 jobs to Fireworks
                                                  (this is the largest pipeline I've
                                tried to date) and executed
                                                  them using:
                                                  qlaunch -r rapidfire --nlaunches
                                infinite --sleep 60
                                                  --maxjobs_queue 50
                                                  All but 6 of them completed
                                successfully and I'm trying to figure
                                                  out what's happened with those 6. 
                                If I try "qlaunch" or
                                                  "rlaunch", neither command
                                recognizes that there are few jobs
                                                  left to complete.  For example:
                                                  $ qlaunch -r singleshot
                                                  2015-09-14 10:47:45,974 INFO No
                                jobs exist in the LaunchPad for
                                                  submission to queue!
                                                  Here are some (hopefully) relevant
                                details.  I'm happy to provide
                                                  more.
                                                  $ lpad get_fws -d count
                                                  514
                                                  $ lpad get_fws -s COMPLETED -d
                                count
                                                  508
                                                  $ lpad get_fws -s WAITING -d count
                                                  6
                                                  $ lpad get_fws -s FIZZLED -d count
                                                  0
                                                  Which Fireworks are waiting?
                                                  $ lpad get_fws -s WAITING | grep
                                fw_id | sort
                                                  "fw_id": 32,
                                                  "fw_id": 33,
                                                  "fw_id": 34,
                                                  "fw_id": 35,
                                                  "fw_id": 36,
                                                  "fw_id": 37,
                                                  What was Firework with fw_id 31,
                                and what happend to it?
                                                  $ lpad get_fws -i 31 -d more
                                                  [See
                                [https://gist.github.com/anonymous/10ea08044f574d190625](https://gist.github.com/anonymous/10ea08044f574d190625)]
                                                  Looking in the launch directory
                                for fw_id 31 (and looking at the
                                                  yaml file I used to submit my
                                workflow), I know that the Firework
                                                  with fw_id 31 should be (as far as
                                I can tell) the only
                                                  dependency for the Firework with
                                fw_id 32.
                                                  Was a launch directory ever
                                created for the Firework with fw_id
                                                  32?  It appears not:
                                                  $ grep -rl '"fw_id": 32,' ./*
                                                  $
                                                  (the same is true for the other
                                "waiting" Fireworks)
                                                  If I try to rerun these Firworks,
                                still no luck:
                                                  $ lpad rerun_fws -i 32$ lpad
                                rerun_fws -i 32
                                                  2015-09-14 10:50:56,652 INFO
                                Finished setting 1 FWs to rerun
                                                  $ qlaunch -r singleshot
                                                  2015-09-14 10:52:37,092 INFO No
                                jobs exist in the LaunchPad for
                                                  submission to queue!
                                                  Is there anything else I should
                                try/check/examine?
                                                  Thank you,
                                                  Derek
                                                  PS: Apologies that this post also
                                got appended to my previous
                                                  "qluanching dependent jobs" thread

Ok, I'll try this out and report back. Apologies that I haven't had time to pursue the debugging further.

Regards,

Derek

···

On Wed, 21 Oct 2015, Anubhav Jain wrote:

Hi Derek,
One of the FWS developer noted a problem in WFLock that might be causing your issue. A fix is in v1.1.7, the latest version.

Some things you can try:

* v1.1.7 should hopefully run your workflows reliably without problem
* If there are problems, the "lpad detect_lostruns" command should be able to detect inconsistent workflows. Furthermore, you can fix them using "lpad detect_lostruns --refresh". This should fix
any workflows from previous FWS versions, but hopefully this should not be needed in v1.1.7.

Best,
Anubhav

On Friday, September 18, 2015 at 10:12:48 AM UTC-7, Derek wrote:
      Ok, I'll try and produce a shareable version that reproduces the
      behavior.

      I'm not doing anything fancy like dynamic FWActions. Data gets
      passed between Fireworks and their FireTasks through
      (predetermined) files.

      Thanks,

      Derek

      On Fri, 18 Sep 2015, Anubhav Jain wrote:

      > Also, another way to confirm that the caching is the problem is to inspect the MongoDB.
      > i) Connect to MongoDB
      > ii) Look at the workflows collection inside your fireworks database
      > iii) Examine the "wf_states" key, and see if you find any discrepancies between that and the fireworks data.
      >
      > But again, sending an example is the only way to really debug this.
      >
      > On Fri, Sep 18, 2015 at 11:21 AM, Anubhav Jain <[email protected]> wrote:
      > Also, it would be helpful if you can indicate some details of your workflow. e.g., are you using any dynamic
      > FWActions?
      > Best,
      > Anubhav
      >
      > On Fri, Sep 18, 2015 at 11:15 AM, Anubhav Jain <[email protected]> wrote:
      > Hi Derek,
      >
      > I think it is related to the same issue I mentioned before, i.e. the caching. You can disable it by modifying
      > the code - there is a get_wf_by_fw_id_lzyfw() method. Change the contents of the method to be the same as
      > get_wf_by_fw_id(). Then it won't use the "LazyFireWork" which has the caching.
      >
      > *But* I think it is best if you can send a runnable example. It doesn't need to be 100% reproducible, even if it
      > fails 25% of the time it is enough that we can at least try to debug. Otherwise we are stuck.
      >
      > Best,
      > Anubhav
      >
      > On Thu, Sep 17, 2015 at 12:52 PM, Derek <[email protected]> wrote:
      > Hi Anubhav,
      >
      > Well, my workflow didn't fail in the same spot, so reproducibility isn't deterministic. Some
      > interesting things I've noticed:
      >
      > ----------
      >
      > lpad get\_fws \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; 514 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_fws -s COMPLETED -d count
      > 514
      > # So all jobs completed
      >
      > lpad get\_wflows \-s RUNNING &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \{ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;name&quot;: &quot;unnamed WF\-\-1&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;state&quot;: &quot;RUNNING&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;states\_list&quot;: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C\-C&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;created\_on&quot;: &quot;2015\-09\-17T02:08:13\.284000&quot; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \} &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \# Assuming C means COMPLETED, it looks like everything is done\. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \# However: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_wflows -s RUNNING -d all | grep RUNNING
      > "20": "RUNNING",
      > "29": "RUNNING",
      > "56": "RUNNING",
      > "110": "RUNNING",
      > "83": "RUNNING",
      > "65": "RUNNING",
      > "94": "RUNNING",
      > "11": "RUNNING",
      > "state": "RUNNING",
      >
      > ----------
      >
      > So it looks like maybe there's some inconsistency between "lpad get_wflows" calls, depending on the
      > options passed? Maybe that narrows down the culprit?
      >
      > If we think that caching is a problem, is there a way to disable it in order to verify that?
      >
      > Thanks,
      >
      > Derek
      >
      > On Wed, 16 Sep 2015, Anubhav Jain wrote:
      >
      > Hi Derek,
      > Examining the changelog (https://pythonhosted.org/FireWorks/changelog.html), I don't see
      > anything past 1.04 that would have affected this issue one way or the other. So if you
      > are seeing it
      > in 1.04 chances are that you will still see it in 1.1.3.
      >
      > Best,
      > Anubhav
      >
      > On Wed, Sep 16, 2015 at 9:36 AM, Derek <[email protected]> > > wrote:
      > I experienced this bug using fireworks version 1.04. Has any of the caching code
      > changed since then?
      >
      > I'm happy to go through the git logs myself if I know which files to look in.
      >
      > I'm going to first rerun my entire workflow (using version 1.04) and see how
      > reproducible the behavior is. If it is perfectly reproducible, I'll try it again in
      > 1.1.3. If it
      > reproduces there, I'll make a bunch of test/dummy jobs and share that code with
      > you.
      >
      > Regards,
      >
      > Derek
      >
      > On Tue, 15 Sep 2015, Anubhav Jain wrote:
      >
      > Hi Derek,
      > Indeed, having a dummy workflow (with test tasks or with sleep tasks) that
      > we can reproduce in its entirety would be ideal.
      >
      > Best,
      > Anubhav
      >
      > On Tue, Sep 15, 2015 at 3:13 PM, Derek > > <[email protected]> wrote:
      > Hi Anubhav,
      >
      > Here is a gist for the output of "lpad get_wflows -i 32 -d all":
      >
      > https://gist.github.com/anonymous/df811ff4f1aec422e44d
      >
      > The workflow that I submitted uses a bunch of external (to
      > python) dependencies, so maybe not the easiest to replicate on
      > another system.
      >
      > I could try to make a dummy workflow where the firetasks are
      > sleep commands--that would be portable. What exactly do you
      > have in mind? I'm guessing this could be a timing-related
      > issue, but beyond that I'm not sure.
      >
      > Regards,
      >
      > Derek
      >
      > On Tue, 15 Sep 2015, Anubhav Jain wrote:
      >
      > Hi Derek,
      > Thanks for reporting this - there is a potential for
      > the FW and WFlow states to not match; this is
      > because a contributor had pushed a change to cache
      > the FW states inside the WFlow to speed up the
      > overall operation of
      > FWS. This causes the storage of FW and WFlow states
      > to be separate. Unfortunately, if there is a bug in
      > this code, it would cause the problems that you see.
      >
      > Do you happen to have any code example I can use to
      > reproduce this problem? Then I can include it as a
      > unit test and ask the contributor to fix it. We
      > already have a basic unit test for this
      > functionality and it passes,
      > so I need an example of how to force a failure.
      >
      > Also, if the "-all" option is too long, you can also
      > use "lpad get_wflows -i 31 -d more" which will
      > provide more information but not nearly as much as
      > "all".
      >
      > Best,
      > Anubhav
      >
      > On Tue, Sep 15, 2015 at 10:26 AM, Derek > > <[email protected]> wrote:
      > Hi Anubhav,
      >
      > Using lpad get_wflows, I see that one of the
      > other dependencies for fw_id 32, fw_id 30, is still
      > RUNNING according to:
      >
      > lpad get\_wflows \-i 32 \-d all | grep &#39;&quot;30&quot;&#39; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;30&quot;: &quot;RUNNING&quot;, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; However, lpad get\_fws shows it as COMPLETED: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_fws -i 30
      > {
      > "name": "Unnamed FW",
      > "fw_id": 30,
      > "state": "COMPLETED",
      > "created_on":
      > "2015-09-10T16:45:38.431775",
      > "updated_on": "2015-09-10T16:52:00.453871"
      > }
      >
      > The associated launcher_* for fw_id 30
      > directory has an empty .error file and a .out file
      > indicating that the Rocket finished and completed.
      > The output that the job produced also looks correct
      > (I ran the
      > command manually and the results match).
      >
      > It seems odd that get_wflows and get_fws show
      > different states for same job.
      >
      > Happy to provide more info as necessary. The
      > "-d all" option for lpad get_wflows produces a lot
      > of output (it's a big workflow), but I can include
      > it if it's helpful.
      >
      > Regards,
      >
      > Derek
      >
      > On Mon, 14 Sep 2015, Anubhav Jain wrote:
      >
      > Hi Derek,
      > The "WAITING" state indicates that there
      > is a job dependency, i.e. a parent job that the
      > current job is waiting to be COMPLETED. Those FWs
      > will not be picked up by rlaunch or qlaunch. Only
      > "READY"
      > jobs are set to be run.
      > See a description of states here:
      >
      >
      > https://pythonhosted.org/FireWorks/reference.html#interpretation-of-state-o
      > f-fws-and-wfs
      >
      > Next, you probably want to try to debug
      > why the FW is not READY yet. One thing you can try
      > is querying the state of all the jobs within the
      > workflow of the affected FW, e.g.:
      >
      > lpad get_wflows -i 31
      >
      > where "31" is the id of one of your
      > affected FWS. This should print out a report of all
      > states in your workflow and you can probably see
      > immediately where things got stuck.
      >
      > If things are still not clear after that
      > and you are confident that something might be wrong,
      > type the command:
      >
      > lpad get_wflows -i 31 -d all
      >
      > and paste the result back into this list
      > so I can take a look at all the links in the
      > workflow and what is happening.
      >
      > Best,
      > Anubhav
      >
      > On Monday, September 14, 2015 at > > 11:05:01 AM UTC-7, Derek wrote:
      > Hello,
      >
      > Over the weekend I submitted just
      > over 500 jobs to Fireworks
      > (this is the largest pipeline I've
      > tried to date) and executed
      > them using:
      >
      > qlaunch -r rapidfire --nlaunches
      > infinite --sleep 60
      > --maxjobs_queue 50
      >
      > All but 6 of them completed
      > successfully and I'm trying to figure
      > out what's happened with those 6.
      > If I try "qlaunch" or
      > "rlaunch", neither command
      > recognizes that there are few jobs
      > left to complete. For example:
      >
      > qlaunch \-r singleshot &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; 2015\-09\-14 10:47:45,974 INFO No &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; jobs exist in the LaunchPad for &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; submission to queue\! &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; Here are some \(hopefully\) relevant &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; details\. I&#39;m happy to provide &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; more\. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_fws -d count
      > 514
      > lpad get\_fws \-s COMPLETED \-d &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; 508 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_fws -s WAITING -d count
      > 6
      > lpad get\_fws \-s FIZZLED \-d count &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; Which Fireworks are waiting? &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad get_fws -s WAITING | grep
      > fw_id | sort
      > "fw_id": 32,
      > "fw_id": 33,
      > "fw_id": 34,
      > "fw_id": 35,
      > "fw_id": 36,
      > "fw_id": 37,
      >
      > What was Firework with fw_id 31,
      > and what happend to it?
      >
      > lpad get\_fws \-i 31 \-d more &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \[See &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; https://gist.github.com/anonymous/10ea08044f574d190625\] &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; Looking in the launch directory &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; for fw\_id 31 \(and looking at the &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; yaml file I used to submit my &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; workflow\), I know that the Firework &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; with fw\_id 31 should be \(as far as &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; I can tell\) the only &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; dependency for the Firework with &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; fw\_id 32\. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; Was a launch directory ever &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; created for the Firework with fw\_id &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; 32? It appears not: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; grep -rl '"fw_id": 32,' ./*
      > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; \(the same is true for the other &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &quot;waiting&quot; Fireworks\) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; If I try to rerun these Firworks, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; still no luck: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt; lpad rerun_fws -i 32$ lpad
      > rerun_fws -i 32
      > 2015-09-14 10:50:56,652 INFO
      > Finished setting 1 FWs to rerun
      >
      > $ qlaunch -r singleshot
      > 2015-09-14 10:52:37,092 INFO No
      > jobs exist in the LaunchPad for
      > submission to queue!
      >
      > Is there anything else I should
      > try/check/examine?
      >
      > Thank you,
      >
      > Derek
      >
      > PS: Apologies that this post also
      > got appended to my previous
      > "qluanching dependent jobs" thread
      >