Programming Things I Wish I Knew Earlier

Programming in a startup is much different than programming at a big company. At a startup, not only are you the developer, but you are also the systems administrator for the most part. I've been startupping for three years now, and have had my ass kicked enough times to step back and think that maybe I should learn how to do things the right way rather than try to bludgeon my way through with raw intellect.

These are the things I wish I had known in the beginning, or at least I wish I hadn't been too subborn to learn.

How To Avoid Over Complicating

Like most software people, I have a natural tendency to over-engineer things. To help fight this urge, I've come up with two simple rules to help myself avoid most of it:

If you are writing a program that touches more than two persistent data stores, it is too complicated.

And the the hard disk counts as a persistent data store. This is kind of a reinterpretation of the Unix way of having small tools with a single input and single output. Tracking state across many different persistent stores and handling failure cases is just too much for one program to do.

If Linux can do it, you shouldn't.

Don't use Hadoop MapReduce until you have a solid reason why xargs won't solve your problem. Don't implement your own lockservice when Linux's advisory file locking works just fine. Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job. Modern Linux distributions are capable of a lot, and most hard problems are already solved for you. You just need to know where to look.

Parallelize When You Have To, Not When You Want To

I know it seems obvious, but sometimes I need to tell myself explicity: if the physical machine is not the bottleneck, do not split the work to multiple physical machines. It is usually pretty apparent when you have to parallelize a CPU bound job, but for I/O bound stuff, you have to do some more in-depth measurement.

For example, if you are doing web crawling, and you have not saturated the pipe to the internet, then it is not worth your time to use more servers. This guy got me thinking about it, he's doing "Large-scale HTTP fetching" in Clojure. He talks about parallelizing with some queueing silliness, but never mentions how much data is moving down the pipe on any one machine. If you have a 100 megabit connection to the internet, and your fetcher is using 700 kilobits, then figure out why your fetcher sucks. (As a side note, I was talking about that post with Milo's prolific systems administrator, and we could not figure out whether the author was an incredibly elaborate troll or just a run-of-the-mill idiot.)

This tidbit also goes for data storage. I know Cassandra is all neat and whiz-bang, but I can pretty much guarantee that you don't need it. Multi-terabyte drives are cheap, and PostgreSQL is a known quantity. It's just not worth the risk.

How to Babysit a Process

This is one I only recently learned. If you have a process running and you want it to be restarted automatically if it crashes, use Upstart. Upstart is a replacement for the init daemon that can do a lot of cool things, one of which is restart a process if it crashes. An example Upstart config to do this would look like this:

respawn              # Respawn this process if it dies

respawn limit 10 600 # If you have to respawn 10 times 
                     # in 10 minutes, give up

exec python /path/to/my/program.py

NoSQL is NotWorthIt

I've talked a lot of shit on NoSQL in the past, but I recently decided to see if I had been living a lie. I tried to use Redis for some not-as-mission-critical systems at Milo - applications where data loss is not that big of a deal. Redis, even though it's an in-memory database, has a virtual memory feature, where you can cap the amount of RAM it uses and have it spill the data over to disk. So, I threw 75GB of data at it, giving it a healthy amount of physical memory to keep hot keys in.

For the most part things went smoothly, until it hit the wall. The Redis server would hang, accepting new sockets, but not servicing any requests. Replication stopped working. When re-starting the server, it would take forty-five minutes (!) from invocation time before it was ready to serve requests.

To nobody's surprise, I was right. Redis was an unknown quantity, both in how much data it could store reliably and how performance degraded. Yes, maybe things could have been different if I used Cassandra or MongoDB, but the point is the same: newfangled stuff is not worth the risk, especially if something like PostgreSQL can do the same job.

Event Loops are Just Okay

One of the most positively retarded things I've ever read comes out of the node.js home page, describing why nonblocking I/O is so great:

Almost no function in Node directly performs I/O, so the process never blocks. Because nothing blocks, less-than-expert programmers are able to develop fast systems.

Statements like this give me The Fear. Nothing ever blocks, huh? What about the callback that Node runs for new requests? If that does any CPU work, it sure as hell blocks. If each callback does 100 milliseconds of CPU work, then the Node server will only be able to handle 10 requests/sec as a theoretical maximum, because the event loop doesn't pick up new requests until the callback is complete. Scalability indeed.

Nothing is more dangerous than a programmer who doesn't know what he doesn't know. Event loops work well if your server is heavily I/O bound, whereas if the server needs to do some nontrivial CPU work, you may be better off with threads. Hell, you can even use both, like Nginx does (well, worker processes, at least), to hold lots of sockets open but still do CPU work asynchronously.

The point is, evented I/O is not magic scalability pixie dust, and like anything, there is a tradeoff.

Hardware Matters

Cloud computing was built for suckers by hustlers. The physical machine your programs run on can make all the difference in the world when it comes to performance and reliability. I am not talking about using an Extra Large EC2 instance for your database because it's "beefier", I am talking about understanding down-to-the-metal, what the performance characteristics of a system are.

For example, in a heavy write throughput application, you want fsync() to return as quickly as possible. (For those of you using MongoDB, fsync() is the system call a program makes to synchronize writes to disk). To the software, fsync is a black box that you can't muddle with, that is, unless you have your shit together in the hardware. If you spend the extra money on a battery-backed RAID controller, fsync can return almost immediately, because the controller can hold writes in battery-backed memory and guarantee that they will be flushed to disk, even in the event of a power failure. On a database machine, you will see a significant write performance increase if you use the correct hardware.

Suppose again that you are serving data from disk in a heavy read throughput application. Data is accessed randomly, so to service those requests, the disk's read heads are scurrying about the platters constantly. With EC2, when Amazon says "I/O performance: High", what does that even mean? Is that suitable for a heavy random read scenario? Again, knowing your shit when it comes to hardware is valuable here. Solid-state hard disks, while expensive, have unbelievable random read performance. (Their sequential read performance isn't amazingly better than spinning disks, though) Spending the extra money on SSD drives is almost always a win, if you have an disk bound problem.

Hardware is one of the best places to put capital to work. It is far more efficient to buy your way out of a performance problem than it is to rewrite software. When running your app on commodity hardware, don't expect anything better than commodity performance.

Ted Dziuba