I was hacking around on part of a project last week that looked something like this:
Fetch an API Response → Validate response → Munge response → Write response to a DB
If any of the steps fail, then the whole production should stop or else Bad Things happen. Furthermore, I would like to be able to pick up where I left off in the process, should one of the steps fail - the validate step was somewhat CPU intensive, so I'd rather not lose that work if it succeeds. This is a pretty common workflow, so I wanted to apply as much of the Unix Way to it as I could, in hopes that my solution would be easier and more robust. That turned out to be true.
Makefile Abuse
As chance would have it, GNU Make solved this problem for me without a whole lot of effort. Here's what my makefile looked like:
api_response.json: curl -o api_response.json http://api.company.com/endpoint validated_response.json: api_response.json validate_response -o validated_response.json api_response.json munged_response.json: validated_response.json munge_response -o munged_response.json validated_response.json update_database: munged_response.json copy_response_to_db munged_response.json clean: rm -f munged_response.json rm -f validated_response.json rm -f api_response.json .PHONY: update_database clean
To execute the workflow, I invoke make -f workflow.mk update_database
, which will do the following:
- Compute the dependency tree:
munged_response.json
depends onvalidated_response.json
which depends onapi_response.json
. - If any of these files is not on disk, it will recursively execute the
make
targets to build the missing ones. - Update the database
- If any of the recursive execution fails (command returns nonzero), freak the fuck out and print an error message.
.PHONY
line tells make
that the clean
and update_database
targets are always out of date, and need to be run every time.
Graceful and Robust
There are a couple of things that I really like about this gadget:
- Fail-fast execution: if any of the steps before
update_database
fail, the database doesn't get updated. - Pick up where you left off: if
munge_response
fails after the fetch and validate steps succeed, the next time it executes, it won't fetch & validate again unless Imake clean
. - Small programs that do small things: instead of one monolithic program, there are 4 independent programs that perform the work:
curl
,validate_response
,munge_response
, andcopy_response_to_db
. This modular system is more debuggable and and robust than a single program that does everything. - Free parallelization where available: since the workflow is a linear dependency chain,
make
can't parallelize it. However, if there were another step that only depended onmunged_response.json
, say,publish_munged_response
,make
would be able to parallelizepublish_munged_response
andupdate_database
, as they are not linearly dependent on one another.
I could have used a pretty standard pipeline to solve this problem, for example:
curl http://api.company.com/endpoint | validate_response | munge_response | copy_response_to_dbBut that would not satisfy the "pick up where you left off" requirement without some Aristocrats joke within each of the processing programs to track state, and pipelines are linear - it's hard to get the free parallelization without doing some shameful things.
Feels good, man.