Running FireWorks on Google Cloud Platform?

cosmo · January 3, 2020, 12:49am

Does anyone have experience running FireWorks workflows on Google Cloud Platform (GCP)?

The only thing I see online about FireWorks in any cloud is the MongoDB server. [It’s straightforward to install MongoDB on Google Compute Engine (GCE) or use Google’s managed MongoDB service.]

I’m helping a team port a FireWorks scientific workflow from a SLURM cluster to Google Compute Engine. So far, FireWorks seems quite adaptable and it partitions responsibilities very neatly for this.

Key differences about running in GCE:

GCE doesn’t seem to have a suitable job queue. (There’s the App Engine task queue, the Cloud Run platform for load balancing among HTTP workers using up to 2GB RAM, Kubernetes Engine, and such, but nothing seems suitable.)
I don’t think we need a job queue. You can create as many as workers as you want as GCE VMs (virtual machines). Each worker can rapidfire launch rockets, then shut down when idle for long enough.
While you can set up an NFS shared file service, it costs literally 1/10 as much to use Google Cloud Storage (GCS) and GCS scales up better. But GCS is a blob store, not a file service, so e.g. you atomically read or write an entire file without support for multiple accessors.
I think it’s best to fetch inputs from GCS, run a rocket, then store its outputs back to GCS. We’ve worked out our Firework inputs & outputs.
- The alternative is to use gcsfuse to mount a storage directory, but it has a bunch of caveats. For one thing, GCS doesn’t have directory nodes. If you create fake directory entries, which are empty files with names ending in /, that greatly speeds up gcsfuse.
Rather than trying to load each developer’s experiment application code and environment (pips, linux apts, config files, environment variables) onto a worker disk image before creating the workers, or mounting some of that via NFS or gcsfuse, a clean approach is to package the payload environment and application code into a Docker image. Each rocket would then run a Firetask that does this:
- pull a Docker image and start a container
- fetch needed input files from GCS
- map those files into the container’s file space
- run a command line in the container that runs the payload Firetask
- store output files to GCS
- delete the container
- delete the input & output files.

I’m building pieces and have yet to run into any problems, just a bunch to learn and implement. The code so far will create workers that launch rockets rapidfire from a requested DB then shut themselves down. It merrily runs tutorial Firetasks. Soon I’ll implement the Firetask that runs Docker payloads.

Questions:

Is there any experience to build on?
Is anyone interested in using this code? May I contribute it to the FireWorks community or to the FireWorks project?
What’s the last Fireworks release fully tested on Python 2.7? (My aim is to complete cloud migration before Python 3 migration.)
In the docs, does “FireServer” mean “the server running MongoDB”? Ditto for “LaunchPad”?

Happy fireworking!

Anubhav_Jain · January 3, 2020, 6:26pm

Hi,

I think it’s great that you’re considering running FireWorks on Google Cloud Platform. In response to your questions:

Unfortunately, no. We have never run on GCP or even AWS and there are nothing planned for the foreseeable future.
I think it would be a great contribution to the FireWorks codebase. However, I would unfortunately not really be able to help with testing / etc. as we don’t have anything with GCP planned.
The last py2 test that seems to have passed is this one: https://circleci.com/gh/materialsproject/fireworks/2263
So you could either use FWS v1.9.4 or check out commit: 6505cd52ca3a7f34b549e6a18c7e2d0d336aa066

I think most incompatibilities with Py3 are due to the underlying libraries and not due to Py3-specific code in FireWorks itself.

Yes, the terminology used in the docs is unclear. I think I initially meant “FireServer” to refer to the physical server that hosts your MongoDB install / LaunchPad and LaunchPad as more the abstract object (e.g., one could imagine a LaunchPad with no external server at all, rather a flat file) - but might have interchanged the terms.

Let me know how it goes! Happy to answer any more questions that come up along the way as I am able.

cosmo · July 16, 2020, 7:21am

I’ve been busy with things like moving our team to Python 3, but FireWorks on Google Cloud Platform works fine and people are welcome to try it!

See the repo borealis and PyPI page borealis-fireworks.

It could use slicker docs (or at least feedback iterations), unit tests, and beta testing.

The parts:

gce: an API and CLI for launching a group of worker nodes. The workers start up from a disk image, and the docs explain how to construct that. (Shell scripts could make that easier.)
Fireworker: The main program for the workers. It sets up logging, gets job parameters, runs lpad rapidfire, and eventually shuts down.
DockerTask: A Firetask that runs a payload task within a Docker container. It copies task inputs and outputs to/from Google Cloud Storage. It’s optional but containers are great to distribute your task code, libraries, and data files to worker nodes.

There’s some related code in the application repo that might be useful to generalize and bring in, e.g. a command line tool that builds our workflow, uploads it to the LaunchPad, and launches the workers.

I had no trouble with FireWorks in building this although it’s unclear whether qadapters would fit in.