I have no formal training in natural language processing. As such, I figure out a lot of this shit on my own.
One of the simplest concepts in NLP/text mining is stemming. If you're not in the know, to stem a word is to remove all the unnecessary shit after its root.
For example, "computer", "computing" and "compute" all stem to "comput". Same root, virtually the same meaning.
Something like this is clearly useful in a search engine like Pressflip, because if somebody searches for "iphone" (and a lot of you people are), the engine should pull up documents that contain the plural (iphones) of the word.
The canonical algorithm for doing this sort of thing is called the Porter Stemming Algorithm, which considers each word on its own. Porter works great 99% of the time, but when it fails, it fucks you hard.
Why You Keep Tryin To Say That Word?
A good example of this comes from the pressflip query logs. A user searched for "marketing". Perfectly reasonable. Porter stemmed that to "market", which returned a bunch of search results about the Dow Jones and Nasdaq. Ouch. Right in the butt.
What went wrong? In smart-talk, the bare infinitive that corresponds to the gerund has a different meaning than the gerund. Again, I know dick-shit about NLP, so maybe you guys have a serious-business name for this sort of thing.
So yeah, gerunds make Porter suck sometimes.
There are some other failure cases I've discovered. Proper nouns will give it to you Clydesdale-style, too. More specifically, proper nouns that don't stem to themselves. Example: "Mariners" and "Marin" both share the same stem. So potentially, someone searching for the baseball team from Seattle will come up with news about the hoity-toity town across the Golden Gate Bridge from San Francisco.
What's the answer to this? If you're a company with millions in VC lottery winnings, you can pay Basistech $100,000 for a 3-year license of their context sensitive stemmer. If you're me, though, you make exclusion lists. Big ones.
That being said, after a large re-processing this weekend, Pressflip search quality is going to improve.
One of the simplest concepts in NLP/text mining is stemming. If you're not in the know, to stem a word is to remove all the unnecessary shit after its root.
For example, "computer", "computing" and "compute" all stem to "comput". Same root, virtually the same meaning.
Something like this is clearly useful in a search engine like Pressflip, because if somebody searches for "iphone" (and a lot of you people are), the engine should pull up documents that contain the plural (iphones) of the word.
The canonical algorithm for doing this sort of thing is called the Porter Stemming Algorithm, which considers each word on its own. Porter works great 99% of the time, but when it fails, it fucks you hard.
Why You Keep Tryin To Say That Word?
A good example of this comes from the pressflip query logs. A user searched for "marketing". Perfectly reasonable. Porter stemmed that to "market", which returned a bunch of search results about the Dow Jones and Nasdaq. Ouch. Right in the butt.
What went wrong? In smart-talk, the bare infinitive that corresponds to the gerund has a different meaning than the gerund. Again, I know dick-shit about NLP, so maybe you guys have a serious-business name for this sort of thing.
So yeah, gerunds make Porter suck sometimes.
There are some other failure cases I've discovered. Proper nouns will give it to you Clydesdale-style, too. More specifically, proper nouns that don't stem to themselves. Example: "Mariners" and "Marin" both share the same stem. So potentially, someone searching for the baseball team from Seattle will come up with news about the hoity-toity town across the Golden Gate Bridge from San Francisco.
What's the answer to this? If you're a company with millions in VC lottery winnings, you can pay Basistech $100,000 for a 3-year license of their context sensitive stemmer. If you're me, though, you make exclusion lists. Big ones.
That being said, after a large re-processing this weekend, Pressflip search quality is going to improve.