hurrdurr.gifI'm working with a startup now on a text summarization project.  The requirements are fairly loose: "take all this text and make it smaller", solving the tl;dr problem (too long; didn't read).  There are a couple of critical details, namely identifying the sentiment of the text, and a few others that are excruciatingly domain-specific.

At first glance, this seems approachable with some natural language processing libraries.  Oh no.  There be dragons.  At Pressflip, I had myself into a few NLP libraries, and the only takeaway I got from all that experience was "Don't use NLP.  Ever."

Why?  NLP is yucky.  It's complicated, the field is rife with academic shitheaddery, there are some major-asspain licensing issues with a couple of software packages and best of all, it's balls slow.  Plus, if you venture down the road of natural language processing, the law of diminishing returns will pull you into a dark alley, pummel you with a tire iron, take you wallet, and then just to be a prick, steal your shoes so you need to walk home barefoot.

My point is, for 99 practical projects out of 100, you can cheat your way out of NLP.  Cook up some fancy shit with word frequencies and logarithms.  Reach back into your information retrieval notes for inspiration.  TF*IDF can take you a long way if you know how to use it.

When I was brainstorming the project I'm working on, my first thought was some hand-waving business about a part-of-speech tagger and a Markov Chain to figure out probabilities of part-of-speech transitions and all that fancy shit.  Factor in a little bit of sentiment detection from God-knows-where and that was my sketch.  Then practicality set in: how much time do you want to spend on this?  If you are considering NLP as the answer to a real problem, it's virtually certain that you're overthinking it.

That being said, NLP does have its place: making the best fucking Wikipedia search engine there ever was with technology licensed from Xerox and then selling yourself to Microsoft.