I don't like to talk about my job.  Don't get me wrong, I like what I do, I just don't like having to explain things to people who are feigning interest.  It's a waste of everyone's time.

When I do have to go into more detail than "I write software", I sex it up by saying "I write artificial intelligence software for recommendation systems".  Sounds pretty awesome when you say it like that, huh? 

Truthfully, that's like describing a summer job at Burger King as "caloric energy distribution engineer".

Yes, one of the things I do is implement machine learning methods for a news recommendation system.  The prerequisite amount of pain-in-the-ass, why-did-I-go-to-college-for-this work, though, dominates the cool stuff.

Vector Space Model AI... Sounds Hot

The idea here is that you turn your data into N-dimensional vectors and let loose a bunch of linear algebra on that shit.  In return, you get stuff like classification and clustering.  If you want to sound like you know what you're talking about here, you can mention stuff like separating hyperplane, sigmoid kernel function, or k-means++.

I deal mostly in the vector space model.  As awesome as all of this sounds, most of the work is a real pain in the balls.  Writing a sequential minimum optimization routine for a support vector machine is a good exercise, but it's not useful in practice.  Somebody else has already written it for me, and besides, that's not the problem I need to be concentrating on.

Most of the methods that deal with VSM machine learning are well defined and fairly easy to implement.  What remains a mystery, though, is the generation of the input.  How you translate your data into vectors is the most important problem to solve.  It's also the most boring. After that, you can worry about shaving 3 nanoseconds off of your dotproduct routine.

Then You Need To Deal With The Academics

Publish or perish.  Yeah, that's cute, but in the real world it's profit or perish, and that means getting useful results.  Academics love machine learning because it affords them the opportunity to make slight variations to undergraduate level mathematical procedures, quantify the result, and write it up in LaTeX with graphs and shit.

It's easy to write a paper showing the effects on precision and recall for a perceptron classifier on the Reuters corpus using normalized vs. non-normalized vectors.  It's not easy to generate data as clean as the Reuters corpus from a web crawl.  Not only is this task hard, it's about as much fun as chemotherapy.  As such, there are no useful papers coming out of academia about how to parse HTML.  Unfortunately, problems like these are the ones that need solving.

When I talked earlier about all of the prerequisite bullshit, this is what I meant.  You get the most testicular pain when dealing in text content, and the real deep-rooted ball ache comes from web content.  We put a ton of effort into our HTML parsing routines, and it has paid off.

For reference, altering a method that helped with removing boilerplate content from a web page (boring) had a greater benefit to the accuracy of our classifier than did dimensionality reduction and normalization combined (sexy).  If you're not picking up what I'm putting down here, I'm saying that the really hard and less science-y improvements made the machine learning better than any of the shit you would read about in an ACM journal.

This Is Going Somewhere I Promise


This really is just a buttsore blog post, but I'm on a roll now.

When I am working with the machine learning part of my job, I am rarely working in my development environment.  Most of the real stuff is done in Excel.  Well, at least it used to be, until I figured out that GNU R is so awesome it makes me want to fuck myself up with a chainsaw.

When I make a change to the inputs of a machine learning method (support vector machine in this case), I need to verify that the change I just made was actually positive.  And since that can't be done with a JUnit test, I have to get all scientific-method on that shit.  Remember in college when you snoozed through advanced statistics because it sucked?  Yeah, me too.  Good thing I kept the book.