November 16, 2004 10:29 AM

Naive Bayes Classification

I've put together a quick summary of I've been playing with for the past few weeks. I have the set of all URLs (and their tags) bookmarked at delicious as of 22 January 2004. Before FOO camp, I downloaded the text from these urls (not following any links and not downloading images or flash), removed the HTML tags and cleaned the resulting text a bit (removing non-"typewriter" characters). This resulted in about 17000 different texts, varying in length from 100 bytes (my arbitrary lower limit) to 1 megabyte (a significant outlier).

I then took a median-length subset of these texts (about 2000) and cleaned their tags. This involved some semiautomatic stemming ("blogs" became "blog", etc) as well as some spelling corrections. I set aside 500 of these texts as "test" data and trained a Naive Bayes classifier on the rest (which took about 5 minutes).

The resulting classifier could assign a correct tag to the previously unseen testing texts about 80% of the time. For example, a given URL might have the three tags: "blog linux technology". The NB classifier would, 80% of the time, tag this URL "blog", "linux" or "technology." (Becuase of the data sparsity, as well as the numerical behavior of the probability calculations, the NB classifier isn't really capable of giving multiple, accurate tags in all cases.)

When I returned from FOO camp, I decided to reimplment the NB classifier, and re-run it on the entire corups of 17000 web pages. Unfortunately, the much larger size of this corups meant that I couldn't clean the tags and texts as thouroughly as before, and the results have been slightly less encouraging: about 70% accuracy. Interestingly, about 80% of the misclassifcations were pages being mis-classified as "blogs," which probably means that "blog" is an overly broad tag.

The next step for me is probably to either tweak the Naive Bayes classifier, to see if I can coax better performance out of it, or perhaps to try a different algorithm. I implemented a support vector machine classifier a while ago; I may dust it off and give that technique a shot.

There are some aspects of this data set that make it both interesting and difficult to deal with.

a) It is multilingual (and unlabelled). There are quite a few pages in Spanish, Italian and German that I have grabbed, and these add pure noise. Not only are many of the words in these documents unique (compared to the english portion of the corups), but we lose the efficiency boost of stopwords. A friend of mine (another FOO camper, Maciej Ceglowski) has written a language iddentification Perl module. If I can use that to "weed out" non english pages, that would be a help.

b) The tags are unstructured. It is common to see tags like "linux/hardware" or "design+awful", which implies a heirarchy or structure that other users do not use.

c) The taggers are heterogenous and untrained. Most text classification depends on categories that are stable, orthogonal and uniformly applied. However, delicious users tag urls inconsistently and subjectively. For example, what one delicious user might tag as "funny" is not necessarily what someone else would. Or, in a better example, one user may tag a URL as "nl," short for "neurolingusitics" while someone else may use that tag to mean "the Netherlands."

However, some of these bugs are features. For example, a given URL might be bookmarked (and tagged) by multiple users, and we should be able to leverage this information. For example, if "" has ten tags total and nine of them are the tag "blog" and one of them is "silly", then boingboing is clearly "blog-ier" than it is "silly." Currently, I only consider unique tags for classification.

Posted by Jeff