Thanks Anubhav.
I am running a 3,000 node workflow that represents a real workflow we have in terms of # nodes and # of links, but have replaced all the tasks with a call to “time”. I’ll verify that it has the same problem and forward it to you.
In the meantime, I have a related problem:
I was trying to restart 500 jobs because of an out of disk failure that killed them. I fixed the space issue and wanted to rerun them.
The restart took so long (8+hrs) that my ssh connection dropped overnight.
Now I have inconsistencies (I am guessing the command was holding a WFlock when it got killed due to terminal disconnect) that the detect_lostruns command line seems unable to recover from:
lpad -l charris.yaml detect_lostruns --fizzle
successfully loaded your custom FW_config.yaml!
2016-04-29 11:39:27,584 DEBUG Detected 1 lost launches: [5395]
2016-04-29 11:39:27,584 INFO Detected 1 lost FWs: [2934]
2016-04-29 11:39:27,585 INFO Detected 231 inconsistent FWs: [8042, 8040, 8038, 8036, 8034, 8032, 8030, 8028, 8026, 8024, 8016, 8014, 8012, 8010, 8008, 8006, 8004, 8002, 8000, 7997, 7996, 7994, 7992, 7990, 7981, 7979, 7977, 7975, 7973, 7971, 7969, 7967, 7965, 7963, 7961, 7959, 7957, 7955, 7953, 7945, 7943, 7941, 7939, 7937, 7935, 7933, 7931, 7929, 7927, 7925, 7923, 7921, 7919, 7917, 7915, 7907, 7905, 7903, 7901, 7899, 7897, 7895, 7893, 7891, 7889, 7887, 7885, 7883, 7881, 7879, 7871, 7869, 7867, 7865, 7863, 7861, 7859, 7857, 7855, 7853, 7851, 7849, 7847, 7845, 7843, 7835, 7833, 7831, 7828, 7825, 7823, 7817, 7815, 7813, 7811, 7809, 7807, 7799, 7797, 7793, 7791, 7789, 7787, 7785, 7783, 7781, 7779, 7777, 7775, 7773, 7771, 7763, 7761, 7759, 7757, 7755, 7753, 7751, 7749, 7747, 7745, 7741, 7739, 7737, 7735, 7727, 7725, 7723, 7721, 7719, 7717, 7713, 7711, 7709, 7707, 7705, 7703, 7701, 7699, 7691, 7687, 7685, 7683, 7681, 7679, 7677, 7675, 7673, 7671, 7669, 7667, 7653, 7651, 7649, 7647, 7645, 7643, 7641, 7637, 7635, 7633, 7631, 7629, 7627, 7619, 7617, 7615, 7613, 7611, 7609, 7607, 7605, 7603, 7601, 7599, 7597, 7595, 7593, 7591, 7547, 7545, 7543, 7541, 7539, 7537, 7533, 7531, 7527, 7525, 7523, 7521, 7519, 7292, 7290, 7288, 7282, 7280, 7278, 7276, 7274, 7272, 7270, 7268, 7226, 7224, 6888, 6884, 6882, 8735, 6405, 6403, 7512, 7479, 7443, 7440, 7407, 7404, 7371, 7368, 7335, 7262, 7198, 7195, 7166, 7134, 7102, 7068, 7065, 7036, 6972, 6937]
You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command
charris@s01 ~/code>lpad -l charris.yaml detect_lostruns --fizzle --refresh
successfully loaded your custom FW_config.yaml!
2016-04-29 11:44:45,397 INFO fw_id 8042 locked. Can’t refresh!
…
Do you have any advice on a manual mongo repair at this point? Or what are my options?
···
On Thursday, April 21, 2016 at 12:55:34 PM UTC-4, ajain wrote:
The regular tuneup will make sure all the indices are set and updated. You can run this while workflows are running. It might slow things down a bit but nothing bad should happen.
The --full will also run a compact() command on Mongo
Honestly, I doubt this is your problem. But there is no harm in trying
Best,
Anubhav
On Thu, Apr 21, 2016 at 9:42 AM, Chris H [email protected] wrote:
Sorry, I see you said just don’t use --full. whats the difference?
On Thursday, April 21, 2016 at 12:39:55 PM UTC-4, Chris H wrote:
Thanks for the reply, i failed to check email updates so didn’t realize you had written back.
I cannot run tuneup while workflows are running, is that correct?
On Tuesday, April 5, 2016 at 1:45:16 PM UTC-4, Anubhav Jain wrote:
Hi Chris,
I haven’t done much testing on rerunning with 3000 node workflow. We though we had solved a lot of such performance issues prior to v1.0, but it might be that certain problems with large workflows are still there. Note that the rerun command recursively loops through your workflow to see what dependencies might need updating. We tried to minimize the amount of data fetched during this process before, but again there might still be problems.
My suggestion is:
- During database downtime, run " lpad admin tuneup --full". This will compact your Mongo database, set/update/refresh indices, etc. I doubt this is the problem, but worth checking. If downtime is not an option, I would remove the “–full” parameter.
- If the problem still persists, perhaps serialize and send over one of your workflows (or an equivalent example large workflow that we can run) along with some steps to reproduce. I can check it on my system and try to offer advice or see if someone can work on this.
Best,
Anubhav
On Monday, April 4, 2016 at 8:15:47 AM UTC-7, Chris H wrote:
I should add that python is pegged at 100% cpu, so I guess maybe there is a code issue, or the way I am using fireworks is not recommended in some way
On Monday, April 4, 2016 at 11:14:18 AM UTC-4, Chris H wrote:
Hi,
Is this a known issue? issuing a command like “rerun_fws” with 19 items takes 5-10minutes to return. mongo shows very little activity via top. My database has about 12,000 entries I believe, so even a linear scan of the whole database would seem faster than I am experiencing in this data size range…
defuse on a 3,000 node workflow takes 20+ minutes
Anyone else experience this?
Any tips on where I might be going wrong? I’m mostly wondering if mongo is just set up poorly, I guess.
Thanks
–
You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/406c3676-60dc-43ad-81c4-9d1f14671f98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.