Porting a bioinformatics tool to the web using WebAssembly, React and javascript
We recently released a beta version of PopPUNK-web (https://web.poppunk.net). This is a WebAssembly (WASM) version of pp-sketchlib which sketches an user-input genome assembly in the browser; transmits this sketch as a JSON to a server running PopPUNK using gunicorn and flask; runs query assignment against a large database of genomes from the GPS project; returns a JSON containing strain assignment, a tree and network; these are then displayed using a react app.
This was my first attempt at front-end development, and there were a few parts I found quite tough to get right. In the post I will cover the main problems and how we solved them.
First, links to the documentation and source code:
- Docs: https://poppunk.readthedocs.io/en/latest/online.html
- PopPUNK source: https://github.com/johnlees/PopPUNK
- PopPUNK-web source: https://github.com/johnlees/PopPUNK-web
- pp-sketchlib source: https://github.com/johnlees/pp-sketchlib
WASM
I used emscripten to compile the sketching code. Initially this just required a target ‘wrapper’ around the class I was using for this, and a single embind function to expose the function to the javascript. However, the following all helped in various ways (mostly compiler options):
- Start with a non-optimised compile until you get everything working.
- Add -O3 or -Os as a first optimisation step.
- Make sure to export the filesystem (FS) module, and node.js/workerfs.js extensions.
- Add options to allow memory growth/malloc.
- Add the modularize option.
- Turn exceptions off (h/t to https://twitter.com/bitmagicio for pointing this out). This required some code changes to replace throw with abort (based on a preprocessor flag), and should obviously only be done as part of optimisation. Of all the optimisation steps, this was most effective, and reduced the size of the WASM and JS files considerably.
- Run the closure compiler as a last step. This improves the JS but mangles some names (especially if not modularised), and sorting this can be tricky.
The compiler flags I ended up using can be found in the web target of src/Makefile: https://github.com/johnlees/pp-sketchlib/blob/master/src/Makefile#L51-L69
React
Integrating the WASM into a create react app was also tricky. We needed to do the following:
- Store your wasm and js from emscripten in the static/ folder.
- Run the WASM in a web worker, so the page doesn’t hang while it’s computing.
- Add a line to turn off ES-lint on the emscripten JS (I did this as part of the make step).
- You can access user files via module.FS, for example to mount a file:
(this will only give access to files that have gone through the ‘upload’ interface of the browser for security, but mounting them allows them to be read by programs running in the browser)module.FS.mount(module.FS.filesystems.WORKERFS, { files: \[f\] }, '/working');
The filesystem was the most difficult part to work out here!
Deployment
This was also mostly new to me. I chose to use Azure to run all of this, and used github actions to automate the deployment when there is new code.
A difficulty is that the python backend runs in a (fairly complex) environment which I normally manage using conda, and requires access to files which are multiple Gb in size. The ‘python app’ deployment is a non-starter, but running a container (App service) and mounting separate storage (Blob storage) did the job.
- I was able to package everything in a conda environment withing docker by following this excellent example.
- ssh was not trivial to set up, but is worth doing as it makes exploring errors a lot easier. Launch the sshd in your entrypoint.sh
- Set up a default route in flask that you can check with curl/your browser to check the app is running.
- Ports: expose a port in the dockerfile (say 8000); add this as WEBSITES_PORT in the application settings; bind loopback in the gunicorn call (-b 0.0.0.0:8000)
- Follow these tips, setting the tmp file to shared memory space, and increasing the number of workers.
- Note also that /dev/shm is currently restricted to 64Mb in azure docker instances, so don’t get too excited about using it.
- dockerhub was easy to get going with for testing, I moved to azure container registry later once everything else was working.
- Tag your images with ’latest’ or the version. If you make one for every commit you’ll soon run out of space.
- Try not to write to the azure shared storage; copy it to the VM on app start if possible.
- You may wish to combine routes into a single request – this reduces latency/overheads.
- shutil function calls may not work on azure (they raise a signal which should be ignored).
- Set up continuous deployment in azure - with dockerhub there is a webhook you can copy across, with azure it is automatic.
- Scale up your azure instance once you’re done testing.
You may find my final dockerfile (and friends) and github actions setup useful. They can be found in the docker/ and .github/workflows subdirectories.
Summary
Although an initial port of C++ code to WASM was relatively easy, I underestimated the amount of engineering required to optimise it, and wire everything up in a way that could be deployed. The first time doing a cloud-based deployment will be difficult, but then it gets a lot easier once you’ve got your head around it.
Particular thanks go to Daniel Anderson who did most of the heavy-lifting with the front-end and testing, and suffered through some tough times with the emscripten file system modules!