In Practice, It's A Lot Harder Than What You Did In College
Up front, let me say that I implemented the vast majority of the machine learning technology behind Persai. In the beginning, I thought it was going to be a breeze. Take some documents, turn them into features, train a classifier, and off you go. The harsh reality is that less than one percent of the time I spent on this system went into the "sexy part" of machine learning, and most of that was done by the guy who wrote the SVM library we use!
The lion's share of time, and the source of most of the hair-pulling, was spent dealing with the data. Data coming in off of the open internet is dirty. Conflicting character set declarations, boilerplate removal, duplicate detection: these things will drive you to insanity and back.
I was fooled by the simplicity when I first learned this stuff. This is what they teach you about vector space model based classifiers:
- Turn your data into vectors.
- Specify the positive and negative samples.
- Train your classifier.
- Tune your vectorization scheme and classifier parameters until the classifier is good.
Publish Or Perish
If turning data into vectors is such a hard problem, why aren't the academics churning out papers about it? Because it's not sexy. There are no numerical nuances to deciding how to handle a document whose declared character set is ISO-8859-1 but is actually encoded in UTF-8. There's no Turing award coming your way for finding a way to make reasonable text out of horrifically malformed HTML that makes you curse Firefox and Internet Explorer for accepting as renderable.
When I started Persai, I admitted that somebody else has already done the mathematical programming better than I ever could. I didn't spend years of my life studying numerical analysis, so chances are, if I attempted to write my own SVM library, I would fail. So, in the interest of success and avoiding Not-Invented-Here syndrome, I used somebody else's library.
People have busted my chops for this, too, as if I am somehow less of an engineer if I use a third-party library. However, one thing has become painfully obvious: the quality of a classifier depends much, much more on your ability to sanitize data than on the algorithm you use.