I was hacking around on part of a project last week that looked something like this:
Fetch an API Response → Validate response → Munge response → Write response to a DB
If any of the steps fail, then the whole production should stop or else Bad Things happen. Furthermore, I would like to be able to pick up where I left off in the process, should one of the steps fail - the validate step was somewhat CPU intensive, so I'd rather not lose that work if it succeeds. This is a pretty common workflow, so I wanted to apply as much of the Unix Way to it as I could, in hopes that my solution would be easier and more robust. That turned out to be true.
Makefile Abuse
As chance would have it, GNU Make solved this problem for me without a whole lot of effort. Here's what my makefile looked like:
api_response.json:
curl -o api_response.json http://api.company.com/endpoint
validated_response.json: api_response.json
validate_response -o validated_response.json api_response.json
munged_response.json: validated_response.json
munge_response -o munged_response.json validated_response.json
update_database: munged_response.json
copy_response_to_db munged_response.json
clean:
rm -f munged_response.json
rm -f validated_response.json
rm -f api_response.json
.PHONY: update_database clean
To execute the workflow, I invoke make -f workflow.mk update_database, which will do the following:
- Compute the dependency tree:
munged_response.jsondepends onvalidated_response.jsonwhich depends onapi_response.json. - If any of these files is not on disk, it will recursively execute the
maketargets to build the missing ones. - Update the database
- If any of the recursive execution fails (command returns nonzero), freak the fuck out and print an error message.
.PHONY line tells make that the clean and update_database targets are always out of date, and need to be run every time.
Graceful and Robust
There are a couple of things that I really like about this gadget:
- Fail-fast execution: if any of the steps before
update_databasefail, the database doesn't get updated. - Pick up where you left off: if
munge_responsefails after the fetch and validate steps succeed, the next time it executes, it won't fetch & validate again unless Imake clean. - Small programs that do small things: instead of one monolithic program, there are 4 independent programs that perform the work:
curl,validate_response,munge_response, andcopy_response_to_db. This modular system is more debuggable and and robust than a single program that does everything. - Free parallelization where available: since the workflow is a linear dependency chain,
makecan't parallelize it. However, if there were another step that only depended onmunged_response.json, say,publish_munged_response,makewould be able to parallelizepublish_munged_responseandupdate_database, as they are not linearly dependent on one another.
I could have used a pretty standard pipeline to solve this problem, for example:
curl http://api.company.com/endpoint | validate_response | munge_response | copy_response_to_dbBut that would not satisfy the "pick up where you left off" requirement without some Aristocrats joke within each of the processing programs to track state, and pipelines are linear - it's hard to get the free parallelization without doing some shameful things.
Feels good, man.