This Is America, Take Your Unicode Somewhere Else

i-only-listen-to-NPR-so-i-can-keep-an-eye-on-what-educated-people-are-up-to-its-merely-an-early-warning-system.jpg

There's a question that comes up on Stack Overflow every couple of months: "How do I strip diacritic marks from Unicode characters?". Popular variants include "How do I remove special characters" and "How do I convert Unicode to ASCII", but the underlying motivation is the same: characters that don't have their own key on an American keyboard have no place in modern web software.

Before you go all apeshit on me and call me a bigot and whatnot, read my story. When I was in college, Google hired me for a summer internship. One of my projects that summer was to write Google's employee directory search. Google, as I'm sure you could imagine, is a very multicultural employer. Googlers in general are very accepting of different cultures, customs, and languages. (Well, sort of. Googlers are accepting of multicultural differences like sushi, Diwali parties, and the word namaste. They're not accepting of cultural differences like Old English 800, 22 inch rims, and the word juicy. The general rule I figured out as a Googler is that you should welcome diversity so long as it doesn't make you feel guilty for making ten times as much money.)

Anyhow, as a result of pulling in a lot of foreign talent, my employee directory search had to handle UTF-8 properly. A lot of peoples' names had umlauts, tildes, and other such little nuggets that love to appear as diamonds with question marks in them. I figured, just make the database UTF-8, page encoding UTF-8, and everything should work fine, right? Well it did, in theory. But when the first super-tolerant Googler typed his colleague's name into my search engine, it didn't come up. There was an o with an umlaut in the name, but our hero of race relations simply typed "o".

And that came through to me as a bug report. "Strip funny characters." So I did, and how the searches flowed. See if you can guess how many people would input diacritic marks into the search box.

Googlers are some of the most understanding people out there, and if they can't be bothered to type Alt-148 for an o with an umlaut, then what hope does the rest of the software industry have? None. That's why I want to systematically dismantle Unicode, and have a good answer to the question "How do I strip diacritic marks?". Not because handling multibyte character sets is too hard (although that asspain is what prompted me to think about this in the first place), but rather because only a small minority of people actually care about it, and an even smaller minority will whine when their umlauts disappear.

(To satisfy the pedants, clearly if you're writing software whose job it is to handle and store UTF-8, this advice isn't for you. I'm talking about web services with user input here.)

Now, you can feel free to take an idealist approach to this problem. Yes, Americans should be more accepting of other cultures and not passively destroy intricate details of pronunciation. Well, feel free to enjoy your floating-point market share. Nobody cares but you.

End Note
I found two decent implementations of Unicode transliteration, one in Python and one in Perl. If you know of good implementations in other languages, e-mail me and I'll add them to this list, with SEO-friendly anchor text goodness.

Strip diacritic marks in Python
Strip diacritic marks in Perl
Strip diacritic marks in Java (thanks to Simon Lieschke)
Strip diacritic marks in Lua (for all 8 of you who use it. Thanks to Petite Abeille)
Strip diacritic marks in PHP (thanks to Tommy Montgomery)

Ted Dziuba

This Is America, Take Your Unicode Somewhere Else

More

Contact