defuse_wflows and rerun_fws very slow on fireworks 1.2.5

Chris · April 4, 2016, 3:14pm

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

Chris · April 4, 2016, 3:15pm

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

···

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

Anubhav_Jain · April 5, 2016, 5:45pm

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run “lpad admin tuneup --full”. This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.
If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

···

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

Chris · April 21, 2016, 4:39pm

Thanks for the reply, i failed to check email updates so didn’t realize you had written back.

I cannot run tuneup while workflows are running, is that correct?

···

On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.

If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

Chris · April 21, 2016, 4:42pm

Sorry, I see you said just don’t use --full. whats the difference?

···

On Thursday, April 21, 2016 at 12:39:55 PM UTC-4, Chris H wrote:

Thanks for the reply, i failed to check email updates so didn’t realize you had written back.

I cannot run tuneup while workflows are running, is that correct?

On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.

If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

Anubhav_Jain · April 21, 2016, 4:54pm

The regular tuneup will make sure all the indices are set and updated. You can run this while workflows are running. It might slow things down a bit but nothing bad should happen.

The --full will also run a compact() command on Mongo

Honestly, I doubt this is your problem. But there is no harm in trying

Best,

Anubhav

···

On Thu, Apr 21, 2016 at 9:42 AM, Chris H [email protected] wrote:

Sorry, I see you said just don’t use --full. whats the difference?

On Thursday, April 21, 2016 at 12:39:55 PM UTC-4, Chris H wrote:

Thanks for the reply, i failed to check email updates so didn’t realize you had written back.

I cannot run tuneup while workflows are running, is that correct?

On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.

If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/406c3676-60dc-43ad-81c4-9d1f14671f98%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chris · April 29, 2016, 3:55pm

Thanks Anubhav.

I am running a 3,000 node workflow that represents a real workflow we have in terms of # nodes and # of links, but have replaced all the tasks with a call to “time”. I’ll verify that it has the same problem and forward it to you.

In the meantime, I have a related problem:

I was trying to restart 500 jobs because of an out of disk failure that killed them. I fixed the space issue and wanted to rerun them.

The restart took so long (8+hrs) that my ssh connection dropped overnight.

Now I have inconsistencies (I am guessing the command was holding a WFlock when it got killed due to terminal disconnect) that the detect_lostruns command line seems unable to recover from:

lpad -l charris.yaml detect_lostruns --fizzle

successfully loaded your custom FW_config.yaml!

2016-04-29 11:39:27,584 DEBUG Detected 1 lost launches: [5395]

2016-04-29 11:39:27,584 INFO Detected 1 lost FWs: [2934]

2016-04-29 11:39:27,585 INFO Detected 231 inconsistent FWs: [8042, 8040, 8038, 8036, 8034, 8032, 8030, 8028, 8026, 8024, 8016, 8014, 8012, 8010, 8008, 8006, 8004, 8002, 8000, 7997, 7996, 7994, 7992, 7990, 7981, 7979, 7977, 7975, 7973, 7971, 7969, 7967, 7965, 7963, 7961, 7959, 7957, 7955, 7953, 7945, 7943, 7941, 7939, 7937, 7935, 7933, 7931, 7929, 7927, 7925, 7923, 7921, 7919, 7917, 7915, 7907, 7905, 7903, 7901, 7899, 7897, 7895, 7893, 7891, 7889, 7887, 7885, 7883, 7881, 7879, 7871, 7869, 7867, 7865, 7863, 7861, 7859, 7857, 7855, 7853, 7851, 7849, 7847, 7845, 7843, 7835, 7833, 7831, 7828, 7825, 7823, 7817, 7815, 7813, 7811, 7809, 7807, 7799, 7797, 7793, 7791, 7789, 7787, 7785, 7783, 7781, 7779, 7777, 7775, 7773, 7771, 7763, 7761, 7759, 7757, 7755, 7753, 7751, 7749, 7747, 7745, 7741, 7739, 7737, 7735, 7727, 7725, 7723, 7721, 7719, 7717, 7713, 7711, 7709, 7707, 7705, 7703, 7701, 7699, 7691, 7687, 7685, 7683, 7681, 7679, 7677, 7675, 7673, 7671, 7669, 7667, 7653, 7651, 7649, 7647, 7645, 7643, 7641, 7637, 7635, 7633, 7631, 7629, 7627, 7619, 7617, 7615, 7613, 7611, 7609, 7607, 7605, 7603, 7601, 7599, 7597, 7595, 7593, 7591, 7547, 7545, 7543, 7541, 7539, 7537, 7533, 7531, 7527, 7525, 7523, 7521, 7519, 7292, 7290, 7288, 7282, 7280, 7278, 7276, 7274, 7272, 7270, 7268, 7226, 7224, 6888, 6884, 6882, 8735, 6405, 6403, 7512, 7479, 7443, 7440, 7407, 7404, 7371, 7368, 7335, 7262, 7198, 7195, 7166, 7134, 7102, 7068, 7065, 7036, 6972, 6937]

You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command

charris@s01 ~/code>lpad -l charris.yaml detect_lostruns --fizzle --refresh

successfully loaded your custom FW_config.yaml!

2016-04-29 11:44:45,397 INFO fw_id 8042 locked. Can’t refresh!

…

Do you have any advice on a manual mongo repair at this point? Or what are my options?

···

On Thursday, April 21, 2016 at 12:55:34 PM UTC-4, ajain wrote:

The regular tuneup will make sure all the indices are set and updated. You can run this while workflows are running. It might slow things down a bit but nothing bad should happen.

The --full will also run a compact() command on Mongo

Honestly, I doubt this is your problem. But there is no harm in trying

Best,

Anubhav

On Thu, Apr 21, 2016 at 9:42 AM, Chris H [email protected] wrote:

Sorry, I see you said just don’t use --full. whats the difference?

On Thursday, April 21, 2016 at 12:39:55 PM UTC-4, Chris H wrote:

Thanks for the reply, i failed to check email updates so didn’t realize you had written back.

I cannot run tuneup while workflows are running, is that correct?

On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.

If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/406c3676-60dc-43ad-81c4-9d1f14671f98%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anubhav_Jain · April 29, 2016, 4:28pm

Hi Chris,

Ok, I will wait for your large workflow example. I hope I can eventually help out in terms of all the timing issues with large workflows.

As for the workflow lock, you have two options. For both these options, please make sure you don’t have any jobs running and things are static. Also, back up your database just in case using mongodump prior to doing anything.

Option 1: pull the latest master branch of FireWorks from github and run “python setup.py install” or “python setup.py develop” (or whatever you use to install). There is a new command called “lpad admin unlock --help” that will help you unlock Workflows. Let me know if you find any issues with it. I can push this as a pip release if it helps, although I haven’t tried it yet.

Option 2: You can do a manual fix by connecting to your fireworks database via Mongo shell and running the following command:

db.workflows.find_and_modify({“nodes”: <FW_ID>}, {$unset: {“locked”: true}})

Hope it helps

Anubhav

···

On Fri, Apr 29, 2016 at 8:55 AM, Chris H [email protected] wrote:

Thanks Anubhav.

I am running a 3,000 node workflow that represents a real workflow we have in terms of # nodes and # of links, but have replaced all the tasks with a call to “time”. I’ll verify that it has the same problem and forward it to you.

In the meantime, I have a related problem:

I was trying to restart 500 jobs because of an out of disk failure that killed them. I fixed the space issue and wanted to rerun them.

The restart took so long (8+hrs) that my ssh connection dropped overnight.

Now I have inconsistencies (I am guessing the command was holding a WFlock when it got killed due to terminal disconnect) that the detect_lostruns command line seems unable to recover from:

lpad -l charris.yaml detect_lostruns --fizzle

successfully loaded your custom FW_config.yaml!

2016-04-29 11:39:27,584 DEBUG Detected 1 lost launches: [5395]

2016-04-29 11:39:27,584 INFO Detected 1 lost FWs: [2934]

2016-04-29 11:39:27,585 INFO Detected 231 inconsistent FWs: [8042, 8040, 8038, 8036, 8034, 8032, 8030, 8028, 8026, 8024, 8016, 8014, 8012, 8010, 8008, 8006, 8004, 8002, 8000, 7997, 7996, 7994, 7992, 7990, 7981, 7979, 7977, 7975, 7973, 7971, 7969, 7967, 7965, 7963, 7961, 7959, 7957, 7955, 7953, 7945, 7943, 7941, 7939, 7937, 7935, 7933, 7931, 7929, 7927, 7925, 7923, 7921, 7919, 7917, 7915, 7907, 7905, 7903, 7901, 7899, 7897, 7895, 7893, 7891, 7889, 7887, 7885, 7883, 7881, 7879, 7871, 7869, 7867, 7865, 7863, 7861, 7859, 7857, 7855, 7853, 7851, 7849, 7847, 7845, 7843, 7835, 7833, 7831, 7828, 7825, 7823, 7817, 7815, 7813, 7811, 7809, 7807, 7799, 7797, 7793, 7791, 7789, 7787, 7785, 7783, 7781, 7779, 7777, 7775, 7773, 7771, 7763, 7761, 7759, 7757, 7755, 7753, 7751, 7749, 7747, 7745, 7741, 7739, 7737, 7735, 7727, 7725, 7723, 7721, 7719, 7717, 7713, 7711, 7709, 7707, 7705, 7703, 7701, 7699, 7691, 7687, 7685, 7683, 7681, 7679, 7677, 7675, 7673, 7671, 7669, 7667, 7653, 7651, 7649, 7647, 7645, 7643, 7641, 7637, 7635, 7633, 7631, 7629, 7627, 7619, 7617, 7615, 7613, 7611, 7609, 7607, 7605, 7603, 7601, 7599, 7597, 7595, 7593, 7591, 7547, 7545, 7543, 7541, 7539, 7537, 7533, 7531, 7527, 7525, 7523, 7521, 7519, 7292, 7290, 7288, 7282, 7280, 7278, 7276, 7274, 7272, 7270, 7268, 7226, 7224, 6888, 6884, 6882, 8735, 6405, 6403, 7512, 7479, 7443, 7440, 7407, 7404, 7371, 7368, 7335, 7262, 7198, 7195, 7166, 7134, 7102, 7068, 7065, 7036, 6972, 6937]

You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command

charris@s01 ~/code>lpad -l charris.yaml detect_lostruns --fizzle --refresh

successfully loaded your custom FW_config.yaml!

2016-04-29 11:44:45,397 INFO fw_id 8042 locked. Can’t refresh!

…

Do you have any advice on a manual mongo repair at this point? Or what are my options?

On Thursday, April 21, 2016 at 12:55:34 PM UTC-4, ajain wrote:

The regular tuneup will make sure all the indices are set and updated. You can run this while workflows are running. It might slow things down a bit but nothing bad should happen.

The --full will also run a compact() command on Mongo

Honestly, I doubt this is your problem. But there is no harm in trying

Best,

Anubhav

On Thu, Apr 21, 2016 at 9:42 AM, Chris H [email protected] wrote:

Sorry, I see you said just don’t use --full. whats the difference?

On Thursday, April 21, 2016 at 12:39:55 PM UTC-4, Chris H wrote:

Thanks for the reply, i failed to check email updates so didn’t realize you had written back.

I cannot run tuneup while workflows are running, is that correct?

On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:

Hi Chris,

I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.

My suggestion is:

During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.

If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.

Best,

Anubhav

On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:

I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way

On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:

Hi,

Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…

defuse on a 3,000 node workflow takes 20+ minutes

Anyone else experience this?

Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.

Thanks

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/406c3676-60dc-43ad-81c4-9d1f14671f98%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/9950dbc4-427d-4ad4-8aef-a53f58ca1b29%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chris · May 10, 2016, 4:02pm

Thanks Anubhav,

The manual workflow unlock followed by a detect_lostruns --refresh worked well!

I am unable to duplicate the long rerun behaviour with my test workflow at the moment, not sure what I may have accidentally changed. Is there anything different at a database level when constructing them in python vs yaml? My normal workflows are built in python, but I added a method to produce the yaml after they are generated.

So for the test I am trying to make for you, I started with that yaml.

Can you talk a little more about why everything needs to be at a standstill before I modify the db? Is that caution, or necessity? If a workflow gets into this state, I’d really like to be able to unlock it and refresh it without stopping other workflows. We’re looking at using fireworks for production. Since many of our jobs are very long running (recently, one took 12 days), no downtime is a highly desirable feature for us.

CCH

Anubhav_Jain · May 10, 2016, 4:22pm

Is there anything different at a database level when constructing them in python vs yaml?

No - Python to YAML will just convert back to Python in the same way. You can verify by comparing them if you want, but there should be no difference.

Can you talk a little more about why everything needs to be at a standstill before I modify the db? Is that caution, or necessity?

That is just an abundance of caution, especially for the first time you tried it. If you are confident that there are no other Fireworks in that particular workflow that you are trying to unlock that are currently running, then you can unlock that workflow without stopping everything else. So you should definitely confirm that nothing in that particular workflow is running, but don’t need to care about other workflows running at the same time when you force an unlock.

I am hoping that the “unlock” thing is not needed too often. If it is, you might consider turning on the WFLOCK_EXPIRATION_KILL parameter in the FW config (see docs). This is a bit more dangerous since it will automatically kill the lock if it can’t acquire it for 3 minutes (WFLOCK_EXPIRATION_SECS). This assumes that the database update will take 3 minutes or less, otherwise there will be problems like inconsistent workflows.

Best,

Anubhav

···

On Tue, May 10, 2016 at 9:02 AM, Chris H [email protected] wrote:

Thanks Anubhav,

The manual workflow unlock followed by a detect_lostruns --refresh worked well!

I am unable to duplicate the long rerun behaviour with my test workflow at the moment, not sure what I may have accidentally changed. Is there anything different at a database level when constructing them in python vs yaml? My normal workflows are built in python, but I added a method to produce the yaml after they are generated.

So for the test I am trying to make for you, I started with that yaml.

Can you talk a little more about why everything needs to be at a standstill before I modify the db? Is that caution, or necessity? If a workflow gets into this state, I’d really like to be able to unlock it and refresh it without stopping other workflows. We’re looking at using fireworks for production. Since many of our jobs are very long running (recently, one took 12 days), no downtime is a highly desirable feature for us.

CCH

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/da5aaa06-70e1-480b-ad3f-af0bfa8df19e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.