I was hacking around on part of a project last week that looked something like this:

Fetch an API Response → Validate response → Munge response → Write response to a DB

If any of the steps fail, then the whole production should stop or else Bad Things happen. Furthermore, I would like to be able to pick up where I left off in the process, should one of the steps fail - the validate step was somewhat CPU intensive, so I'd rather not lose that work if it succeeds. This is a pretty common workflow, so I wanted to apply as much of the Unix Way to it as I could, in hopes that my solution would be easier and more robust. That turned out to be true.

Makefile Abuse

As chance would have it, GNU Make solved this problem for me without a whole lot of effort. Here's what my makefile looked like:

api_response.json:
    curl -o api_response.json http://api.company.com/endpoint

validated_response.json: api_response.json
    validate_response -o validated_response.json api_response.json

munged_response.json: validated_response.json
    munge_response -o munged_response.json validated_response.json

update_database: munged_response.json
    copy_response_to_db munged_response.json

clean:
    rm -f munged_response.json
    rm -f validated_response.json
    rm -f api_response.json

.PHONY: update_database clean

To execute the workflow, I invoke make -f workflow.mk update_database, which will do the following:

  1. Compute the dependency tree: munged_response.json depends on validated_response.json which depends on api_response.json.
  2. If any of these files is not on disk, it will recursively execute the make targets to build the missing ones.
  3. Update the database
  4. If any of the recursive execution fails (command returns nonzero), freak the fuck out and print an error message.
The .PHONY line tells make that the clean and update_database targets are always out of date, and need to be run every time.

Graceful and Robust

There are a couple of things that I really like about this gadget:

  • Fail-fast execution: if any of the steps before update_database fail, the database doesn't get updated.
  • Pick up where you left off: if munge_response fails after the fetch and validate steps succeed, the next time it executes, it won't fetch & validate again unless I make clean.
  • Small programs that do small things: instead of one monolithic program, there are 4 independent programs that perform the work: curl, validate_response, munge_response, and copy_response_to_db. This modular system is more debuggable and and robust than a single program that does everything.
  • Free parallelization where available: since the workflow is a linear dependency chain, make can't parallelize it. However, if there were another step that only depended on munged_response.json, say, publish_munged_response, make would be able to parallelize publish_munged_response and update_database, as they are not linearly dependent on one another.

I could have used a pretty standard pipeline to solve this problem, for example:

curl http://api.company.com/endpoint | validate_response | munge_response | copy_response_to_db
But that would not satisfy the "pick up where you left off" requirement without some Aristocrats joke within each of the processing programs to track state, and pipelines are linear - it's hard to get the free parallelization without doing some shameful things.

Feels good, man.