dogs_love_md5.jpgThere have been a handful of places within the Persai pipeline where I have needed unique identifiers of varying length.  64 bits here, 32 bits there.  I'm not the only one to ever have to solve this problem, but I could never find a concise toolbox of information on it.

Automatic Increment or Not


MySQL has the AUTO_INCREMENT modifier for integral record keys.  That's great, if you're using MySQL.  In general, prefer a non-automatically increasing record identifier, unless you have a specific reason.  Here's why:

  • You may actually have to think about thread synchronization at some point when creating records.
  • If these identifiers become publicly visible, they can leak information about how many records are in your database.
  • If you make identifiers out of other pieces of data (say URLs), then you can't get the identifier value of a given datum without a table lookup.  And even then, you'll need another index on that field.
There are a few cases where automatic increment identifiers are good, though:

  • You are using a MySQL database and are setting up a simple structure of tables. (i.e. MySQL handles synchronization for you and it's actually harder to not use automatic increment)
  • The creation order of records is really important to you, but not important enough to store a timestamp field.

Making an Identifier Out Of Arbitrary Data

Easy, right?  Just hash whatever data you've got.  It's not reversible and spread uniformly over the identifier space.  However, many times the output of a standard hashing algorithm is too big.  SHA-1, for example, is 160 bits wide.  Way too long for most purposes.

In this case, I truncate the output.  Yes, this is mathematically valid, because any good hashing algorithm's output will be uniform over the range of the function.  And by uniform, I mean really uniform.  For example, if you take the first 64 bits of a 160-bit SHA-1 hash and call that your unique identifier, the probability of a collision is going to be uniform over the space of all 64-bit numbers.  If it wasn't (i.e. the first 64-bits of a SHA-1 hash were distributed, say normally), then the hash function would be cryptographically insecure.

Don't try to swing your dick around and come up with your own hash function.  You'll screw it up.  I know I have.

GUIDs

I got an e-mail from a reader about using GUIDs for unique identifiers.  This fits with the hashing scheme, but for the most part, I think GUIDs are far too large, especially if you are storing a lot of records.  GUIDs are 128 bits wide, so if you have a hundred million records, that's about 1.5GB worth of identifiers.  Use a 64-bit identifier, and your space is halved, without a significant increase in collision probability.

Making An Identifier Easier On The Eyes

If you need to put a unique identifier in a URL, it can't look too nerdy.  For example, this URL looks like shit:

http://www.website.com/document?id=1b25a53bf21d0206
Too many numbers.  So, to make it look better, Base-64 encode it.  It will lengthen the code a little, but it's much easier to look at:

http://www.website.com/document?id=ZnJvc3RlZCBidXR0cw==
Eh, well it looks better to me.  Personal taste, I guess.

You'll need to make sure that your Base-64 alphabet doesn't include the + and / characters: they aren't URL safe.

Sort Orderings

Don't worry about sort ordering unless you have to worry about sort ordering.  Duh.  The vast majority of Persai's data is stored simply as files, and for most purposes we don't have to care about the processing order.  We're fortunate in that regard (well maybe not fortunate, I mean that's like saying you're fortunate that you're not fat because you exercise and eat sensibly).

Anyway, there are a couple of places in Persai where sort order matters.  The ordering of recommendations, for example.  There, though, we're just ordering by time, and we need to display the exact time, not just the relative times of the recommendations, so we store a date field and order data by it in the store.

This drives one of my earlier points home: if you need ordering by time, don't count on an automatic increment unique identifier to do it.  It's much more robust to store a timestamp.

In fact this point goes deeper.  Very rarely do you actually need records sorted by record identifier.  What you need is the records sorted by some other value that happens to be reflected in the record identifier by virtue of automatic increment and the insertion order.  It's always more robust to store the actual value you need to sort by.

I'm Not Going To Tell You How To Write Code

Because I don't really care.  This is how I do it, though.