I was hacking around on part of a project last week that looked something like this:
Fetch an API Response → Validate response → Munge response → Write response to a DB
If any of the steps fail, then the whole production should stop or else Bad Things happen. Furthermore, I would like to be able to pick up where I left off in the process, should one of the steps fail - the validate step was somewhat CPU intensive, so I'd rather not lose that work if it succeeds. This is a pretty common workflow, so I wanted to apply as much of the Unix Way to it as I could, in hopes that my solution would be easier and more robust. That turned out to be true.
As chance would have it, GNU Make solved this problem for me without a whole lot of effort. Here's what my makefile looked like:
api_response.json: curl -o api_response.json http://api.company.com/endpoint validated_response.json: api_response.json validate_response -o validated_response.json api_response.json munged_response.json: validated_response.json munge_response -o munged_response.json validated_response.json update_database: munged_response.json copy_response_to_db munged_response.json clean: rm -f munged_response.json rm -f validated_response.json rm -f api_response.json .PHONY: update_database clean
To execute the workflow, I invoke
make -f workflow.mk update_database, which will do the following:
- Compute the dependency tree:
validated_response.jsonwhich depends on
- If any of these files is not on disk, it will recursively execute the
maketargets to build the missing ones.
- Update the database
- If any of the recursive execution fails (command returns nonzero), freak the fuck out and print an error message.
update_databasetargets are always out of date, and need to be run every time.
Graceful and Robust
There are a couple of things that I really like about this gadget:
- Fail-fast execution: if any of the steps before
update_databasefail, the database doesn't get updated.
- Pick up where you left off: if
munge_responsefails after the fetch and validate steps succeed, the next time it executes, it won't fetch & validate again unless I
- Small programs that do small things: instead of one monolithic program, there are 4 independent programs that perform the work:
copy_response_to_db. This modular system is more debuggable and and robust than a single program that does everything.
- Free parallelization where available: since the workflow is a linear dependency chain,
makecan't parallelize it. However, if there were another step that only depended on
makewould be able to parallelize
update_database, as they are not linearly dependent on one another.
I could have used a pretty standard pipeline to solve this problem, for example:
curl http://api.company.com/endpoint | validate_response | munge_response | copy_response_to_dbBut that would not satisfy the "pick up where you left off" requirement without some Aristocrats joke within each of the processing programs to track state, and pipelines are linear - it's hard to get the free parallelization without doing some shameful things.
Feels good, man.