Hello Anubhav,
thanks for the answer. Finally, I found some opportunity & time to do as suggested on a job that actually got killed a few days ago after exceeding the maximum walltime of 4 days.
Issue 1:
Here the MOAB job log (/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/NEMO_AU_111_r__25_An.e6012657)
- cd /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683
- rlaunch -w /home/fr/fr_fr/fr_jh1130/.fireworks/nemo_queue_worker.yaml -l /home/fr/fr_fr/fr_jh1130/.fireworks/fireworks_mongodb_auth.yaml singleshot --offline --fw_id 15514
=>> PBS: job killed: walltime 345642 exceeded limit 345600
The FW ID is 15514 and content of /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/FW_ping.json is
{“ping_time”: “2019-07-17T22:54:51.000760”}
That being the last update agrees very well with the maximum walltime. /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/FW_offline.json shows that the run started exactly four days earlier:
{“launch_id”: 11789, “started_on”: “2019-07-13T22:54:49.124427”, “checkpoint”: {"_task_n": 0, “_all_stored_data”: {}, “_all_update_spec”: {}, “_all_mod_spec”: []}}
A manual check shows no other files in this launchdir have been touched afterwards:
$ ls -lht
total 8,0G
-rw------- 1 fr_jh1130 fr_fr 28K 18. Jul 00:55 NEMO_AU_111_r__25_An.e6012657
-rw------- 1 fr_jh1130 fr_fr 43 18. Jul 00:54 FW_ping.json
-rw------- 1 fr_jh1130 fr_fr 663K 18. Jul 00:52 log.lammps
-rw------- 1 fr_jh1130 fr_fr 83M 18. Jul 00:52 default.mpiio.restart1
…
However, the update state in the “launch” collection just corresponds to the current time (see state_history[1]: updated_on):
Am I correct in assuming that the repeatedly running lpad recover_offline updates this time after reading FW_offline.json?
That I read from the recover_offline code https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1728-L1730
As you see, the type is “String”, no datetime type.
Would that be the expected behavior? Or should lpad recover_offline leave the updated_on key untouched, if no update has been recorded to the FW_ping.json?
Issue 2
Wouldn’t it be the quick solution to always “forget” the offline run by the already existing “lpad.forget_offline” method internally when calling “lpad detect_lostruns --fizzle / --rerun”?
I don’t see any situation where one would want to keep an offline run already explicitly identified as “dead” available to the “recover_offline” functionality.
Best regards,
Johannes
For completeness, the according lpad get_fws output:
lpad get_fws -i 15514 -d all
{
"spec": {
"_category": "nemo_queue_offline",
"_files_in": {
"coeff_file": "coeff.input",
"data_file": "datafile.lammps",
"input_header": "lmp_header.input",
"input_production": "lmp_production.input"
},
"_files_out": {
"ave_file": "thermo_ave.out",
"data_file": "default.lammps",
"log_file": "log.lammps",
"ndx_file": "groups.ndx",
"traj_file": "default.nc"
},
"_queueadapter": {
"nodes": 16,
"ppn": 20,
"queue": null,
"walltime": "96:00:00"
},
"_tasks": [
{
"_fw_name": "CmdTask",
"cmd": "lmp",
"fizzle_bad_rc": true,
"opt": [
"-in lmp_production.input",
"-v coeffInfile coeff.input",
"-v coeffOutfile coeff.input.transient",
"-v compute_group_properties 1",
"-v compute_interactions 0",
"-v dataFile datafile.lammps",
"-v dilate_solution_only 1",
"-v freeze_substrate 0",
"-v freeze_substrate_layer 14.0",
"-v has_indenter 1",
"-v rigid_indenter_core_radius 12.0",
"-v constant_indenter_velocity -1e-06",
"-v mpiio 1",
"-v netcdf_frequency 50000",
"-v productionSteps 17500000",
"-v pressureP 1.0",
"-v pressurize_z_only 1",
"-v pressurize_solution_only 0",
"-v reinitialize_velocities 0",
"-v read_groups_from_file 0",
"-v rigid_indenter 0",
"-v restrained_indenter 0",
"-v restart_frequency 50000",
"-v store_forces 1",
"-v surfactant_name SDS",
"-v temperatureT 298.0",
"-v temper_solid_only 1",
"-v temper_substrate_only 0",
"-v thermo_frequency 5000",
"-v thermo_average_frequency 5000",
"-v use_barostat 0",
"-v use_berendsen_bstat 0",
"-v use_dpd_tstat 1",
"-v use_eam 1",
"-v use_ewald 1",
"-v write_coeff 1",
"-v write_coeff_to_datafile 0",
"-v write_groups_to_file 1",
"-v coulomb_cutoff 8.0",
"-v ewald_accuracy 0.0001",
"-v neigh_delay 2",
"-v neigh_every 1",
"-v neigh_check 1",
"-v skin_distance 3.0"
],
"stderr_file": "std.err",
"stdout_file": "std.out",
"store_stderr": true,
"store_stdout": true,
"use_shell": true
}
],
"_trackers": [
{
"filename": "log.lammps",
"nlines": 25
}
],
"metadata": {
"barostat_damping": 10000.0,
"ci_preassembly": "at polar heads",
"compute_group_properties": 1,
"constant_indenter_velocity": -1e-06,
"constant_indenter_velocity_unit": "Ang_per_fs",
"coulomb_cutoff": 8.0,
"coulomb_cutoff_unit": "Ang",
"counterion": "NA",
"ewald_accuracy": 0.0001,
"force_field": {
"solution_solution": "charmm36-jul2017",
"substrate_solution": "interface_ff_1_5",
"substrate_substrate": "Au-Grochola-JCP05-units-real.eam.alloy"
},
"frozen_sb_layer_thickness": 14.0,
"frozen_sb_layer_thickness_unit": "Ang",
"indenter": {
"crystal_plane": 111,
"equilibration_time_span": 50,
"equilibration_time_span_unit": "ps",
"initial_radius": 25,
"initial_radius_unit": "Ang",
"initial_shape": "sphere",
"lammps_units": "real",
"melting_final_temperature": 1800,
"melting_time_span": 10,
"melting_time_span_unit": "ns",
"minimization_ftol": 1e-05,
"minimization_ftol_unit": "kcal",
"natoms": 3873,
"orientation": "111 facet facing negative z",
"potential": "Au-Grochola-JCP05-units-real.eam.alloy",
"quenching_time_span": 100,
"quenching_time_span_unit": "ns",
"quenching_time_step": 5,
"quenching_time_step_unit": "fs",
"substrate": "AU",
"temperature": 298,
"temperature_unit": "K",
"time_step": 2,
"time_step_unit": "fs",
"type": "AFM tip"
},
"langevin_damping": 1000.0,
"machine": "NEMO",
"mode": "TRIAL",
"mpiio": 1,
"neigh_check": 1,
"neigh_delay": 2,
"neigh_every": 1,
"netcdf_frequency": 50000,
"pbc": 111,
"pressure": 1,
"pressure_unit": "atm",
"production_steps": 17500000,
"restrained_sb_layer_thickness": null,
"restrained_sb_layer_thickness_unit": null,
"sb_area": 2.25e-16,
"sb_area_unit": "m^2",
"sb_base_length": 150,
"sb_base_length_unit": "Ang",
"sb_crystal_plane": 111,
"sb_crystal_plane_multiples": [
52,
90,
63
],
"sb_in_dist": 30.0,
"sb_in_dist_unit": "Ang",
"sb_lattice_constant": 4.075,
"sb_lattice_constant_unit": "Ang",
"sb_measures": [
1.49836e-08,
1.49725e-08,
1.47828e-08
],
"sb_measures_unit": "m",
"sb_multiples": [
52,
30,
21
],
"sb_name": "AU_111_150Ang_cube",
"sb_natoms": 196560,
"sb_normal": 2,
"sb_shape": "cube",
"sb_thickness": 1.5e-08,
"sb_thickness_unit": "m",
"sb_volume": 3.375e-23,
"sb_volume_unit": "m^3",
"sf_concentration": 0.0068,
"sf_concentration_unit": "M",
"sf_nmolecules": 646,
"sf_preassembly": "monolayer",
"skin_distance": 3.0,
"skin_distance_unit": "Ang",
"solvent": "H2O",
"state": "production",
"step": "production_nemo_trial_with_dpd_tstat",
"substrate": "AU",
"surfactant": "SDS",
"sv_density": 997,
"sv_density_unit": "kg m^-3",
"sv_preassembly": "random",
"system_name": "646_SDS_monolayer_on_AU_111_150Ang_cube_with_AU_111_r_25Ang_indenter_at_-1e-06_Ang_per_fs_approach_velocity",
"temperature": 298,
"temperature_unit": "K",
"thermo_average_frequency": 5000,
"thermo_frequency": 5000,
"type": "AFM",
"use_barostat": 0,
"use_dpd_tstat": 1,
"use_eam": 1,
"use_ewald": 1,
"workflow_creation_date": "2019-07-13-22:53"
},
"_files_prev": {
"coeff_file": "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/coeff_hybrid.input",
"input_header": "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/lmp_header.input",
"input_production": "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/lmp_production.input",
"data_file": "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-54-00-115840/default.lammps",
"ndx_file": "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-54-00-115840/groups.ndx"
}
},
"fw_id": 15514,
"created_on": "2019-07-13T22:53:09.213733",
"updated_on": "2019-07-24T12:01:26.321000",
"launches": [
{
"fworker": {
"name": "nemo_queue_worker",
"category": [
"nemo_queue_offline"
],
"query": "{}",
"env": {
"lmp": "module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load lammps/16Mar18-gnu-7.3-openmpi-3.1-colvars-09Feb19; mpirun {MPIRUN_OPTIONS} lmp",
“exchange_substrate.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools ovitos; exchange_substrate.py”,
“extract_bb.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/12Mar19-python-2.7; extract_bb.py”,
“extract_indenter_nonindenter_forces_from_netcdf.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/11Jul19; extract_indenter_nonindenter_forces_from_netcdf.py”,
“extract_property.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools ovitos; extract_property.py”,
“extract_thermo.sh”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools; extract_thermo.sh”,
“join_thermo.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools; join_thermo.py”,
“merge.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/12Mar19-python-2.7; merge.py”,
“ncfilter.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/11Jul19; mpirun ${MPIRUN_OPTIONS} ncfilter.py”,
“ncjoin.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools; ncjoin.py”,
“pizza.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/12Mar19-python-2.7; pizza.py”,
“strip_comments.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools/12Mar19-python-2.7; strip_comments.py”,
“to_hybrid.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools; to_hybrid.py”,
“vmd”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load vmd/1.9.3-text; vmd”,
“smbsync.py”: “module purge; module use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools; smbsync.py”
}
},
“fw_id”: 15514,
“launch_dir”: “/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683”,
“host”: “login2.nemo.privat”,
“ip”: “10.16.44.2”,
“trackers”: [
{
“filename”: “log.lammps”,
“nlines”: 25,
“allow_zipped”: false
}
],
“action”: null,
“state”: “RUNNING”,
“state_history”: [
{
“state”: “RESERVED”,
“created_on”: “2019-07-13T22:54:14.596648”,
“updated_on”: “2019-07-13T22:54:14.596655”,
“reservation_id”: “6012657”
},
{
“state”: “RUNNING”,
“created_on”: “2019-07-13T22:54:49.124427”,
“updated_on”: “2019-07-24T12:01:26.363237”,
“checkpoint”: {
“_task_n”: 0,
“_all_stored_data”: {},
“_all_update_spec”: {},
“_all_mod_spec”: []
}
}
],
“launch_id”: 11789
}
],
“state”: “RUNNING”,
“name”: “NEMO, AU 111 r = 25 Ang indenter at -1e-06 Ang_per_fs approach velocity on 646 SDS monolayer on AU 111 150 Ang cube substrate, LAMMPS production”
}
···
Am Mittwoch, 5. Juni 2019 03:02:51 UTC+2 schrieb Anubhav Jain:
Hi Johannes,
Thanks for reporting these issues. We do not run offline mode ourselves, so sometimes there are issues that we are unaware of.
Regarding issue 1:
For jobs that are stuck in the RUNNING state, the crucial thing that needs to be correct in order for “detect_lostruns” to work properly is the timestamp on the last ping of the launch. Could you try to check the following (let me know if you need help with this process):
- Identify a job that has this problem, and where you’ve already run the recover_offline() command on it
- Go to the directory where that job ran
- There should be a file called FW_ping.json. Look inside and note down the “ping_time” of that file
- There should also be a file called FW_offline.json. Look inside and note down the “launch_id” in that file
- Next, we want to check the database for consistency. You want to search your “launches” collection (either through MongoDB itself, or through pymongo, or through the “launches” collection in the LaunchPad object) for the launch id that you noted in #4. In that document for that launch id, you should see a key called “state_history”. In there should be an entry where you see “updated_on”. See screenshot for example …
- Now the two things for you to confirm:
A: does the updated_on timestamp mach the FW_ping.json “ping_time” that you noted earlier? If not, is the timestamp later or earlier?
B: is the type of the updated_on timestamp a String type (as opposed to a datetime type)?
Regarding issue 2:
I think this is a separate issue. When you run “lpad detect_lostruns --fizzle” the database knows that the job is FIZZLED, but the filesystem information in FW_offline.json still thinks the job is running / completed / etc. Thus when running recover_offline() again, the file system information overrides the DB information and you end up forgetting that you decided to fizzle the job.
Unfortunately, this does mean that at the current stage you need to manually “forget” about the information on the filesystem any time you want to change the state of an offline Firework using one of the Launchpad commands. I’ve added an issue about this on Github (https://github.com/materialsproject/fireworks/issues/326), but unfortunately don’t have a quick fix at the moment.
On Friday, May 31, 2019 at 5:30:07 AM UTC-7, Johannes Hörmann wrote:
Dear Fireworks Team,
In the course of my PhD, I have been using Fireworks since about a year for managing work flows on different computing resources, most importantly on the supercomputers NEMO in Freiburg and the Jülich machine JUWELS. While NEMO is using the queueing system MOAB/Torque, JUWELS employs SLURM. On both machines, I submit jobs vial Firework’s offline mode in order to be independent from a stable connection between computing nodes and MongoDB (which would have to be tunneled via the login nodes, not reliable). On the login nodes, usually have an infinite loop running the command
lpad -l “{FW_CONFIG_PREFIX}/fireworks_mongodb_auth.yaml" recover_offline -w "{QLAUNCH_FWORKER_FILE}”
every couple of minutes checking for job state updates.
What I became aware of over the time is that on the JUWELS/SLURM machine, offline jobs fizzle properly, even when the are cancelled due to the walltime running out. I assume that SLURM sends a proper signal to rlaunch and allows some clean-up work to be done before forcefully killing.
On the NEMO/MOAB machine, however, it seems the job is killed immediately if walltime expires, and its stays marked as “running” indefinitely. I have to manually use “lpad detect_lostruns” to fizzle the Firework and here I want to point out two issues:
The first issue is that selecting the “dead” runs by the “–time” options of “lpad detect_lostruns” oftentimes does not work as expected. Even if the runs has been “dead” for days, it might happen that “detect_lostruns” does not recognize it as “lost” and I have to go down to a few seconds with the expiration time to have the lost run(s) show up. But then, of course, also other healthy runs appear in the list. Here I would like to ask whether this behavior might be related to the the “recover” loop running in background continuously, as described above?
The second, related issue is that even if i mark a lost run on the NEMO/MOAB machine as “fizzled” by “lpad detect_lostruns --fizzle” (and maybe a suitable --query in order to narrow the selection), it will get marked as “running” again by the next call of “lpad recover_offline” as shown above. The only way I can avoid that behavior is stopping the automized recovery loop and executing the python command “lp.forget_offline(accordingFireWorksID,launch_mode=False)”. Only then the next “recover_offline” will leave the run in question marked as “fizzled”.
I have observed these issues mostly for Fireworks 1.8.7, but a few days ago I updated to 1.9.1 and I believe they still persist. Would you have an idea about the source of those two (probably related?) issues?
Best regars,
Johannes Hörmann