Sitting on docks of Bayes

There is an article on Slashdot entitled Bayesian Filtering for Dummies that points to a BBC article about spam and talks about Bayesian filtering as a method to combat it. What is interesting to me is the commentary under the /. article. Someone idly speculated about implementing a Bayesian troll filter for Slashdot comments. This brought out a chorus of naysayers pointing out how it won’t work. I think they are wrong. From my experiments with my CRM114 based Usenet filter, I think it would work. Some of the sages point out that spam gets filtered “because it has different characteristics since they are trying to sell you things.” In the Usenet forsale groups, everyone is trying to sell things and yet my filter was able to pick out things I cared about versus things I didn’t care about with >98% accuracy with a corpus of under 30 messages. That’s the beauty of Bayesian mechanisms, you don’t have to know or even understand what features it learns on, only that it does learn the features of the texts. CRM114 goes most one better by learning the words in all combinations of 1-5 word chunks (it does one pass learning all single words, another learning every two adjacent words, etc.) Still, I think the troll filter would in fact work and probably faster and more effectively than intuition would suggest. I don’t know how it would pick them as noise, but I’ll bet a good filter could. There is already a huge corpus and scoring associated with every comment. Run it over the DB, and automatically learn “this is a +3 article, this is a -2, etc.”

One Reply to “Sitting on docks of Bayes”

  1. I’ve been thinking of using CRM114 for other, non-email related filtering tasks myself. It seems amazing, especially now that it has multiple filtering techniques 🙂

    But there’s one problem with the Slashdot processing idea, as stated above – it can’t tell the relevance of the comments to the topic subject or thread conversation. Every individual Slashdot article would need it’s own filtering, for the topic … in general. I could see it applied, perhaps, in a general way, to categories of articles’ comments. (Like flagging trollish Mac language in Linux articles) Still, nothing replaces the humans, because the ratings themselves are subjective and only gain value when applied by individuals to individual posts within the context of the article or comment thread. It’s more complex than just pattern matching can handle, I expect …

Comments are closed.