November 16, 2004 10:29 AM

Naive Bayes Classification

I've put together a quick summary of I've been playing with for the past few weeks. I have the set of all URLs (and their tags) bookmarked at delicious as of 22 January 2004. Before FOO camp, I downloaded the text from these urls (not following any links and not downloading images or flash), removed the HTML tags and cleaned the resulting text a bit (removing non-"typewriter" characters). This resulted in about 17000 different texts, varying in length from 100 bytes (my arbitrary lower limit) to 1 megabyte (a significant outlier).

I then took a median-length subset of these texts (about 2000) and cleaned their tags. This involved some semiautomatic stemming ("blogs" became "blog", etc) as well as some spelling corrections. I set aside 500 of these texts as "test" data and trained a Naive Bayes classifier on the rest (which took about 5 minutes).

The resulting classifier could assign a correct tag to the previously unseen testing texts about 80% of the time. For example, a given URL might have the three tags: "blog linux technology". The NB classifier would, 80% of the time, tag this URL "blog", "linux" or "technology." (Becuase of the data sparsity, as well as the numerical behavior of the probability calculations, the NB classifier isn't really capable of giving multiple, accurate tags in all cases.)

When I returned from FOO camp, I decided to reimplment the NB classifier, and re-run it on the entire corups of 17000 web pages. Unfortunately, the much larger size of this corups meant that I couldn't clean the tags and texts as thouroughly as before, and the results have been slightly less encouraging: about 70% accuracy. Interestingly, about 80% of the misclassifcations were pages being mis-classified as "blogs," which probably means that "blog" is an overly broad tag.

The next step for me is probably to either tweak the Naive Bayes classifier, to see if I can coax better performance out of it, or perhaps to try a different algorithm. I implemented a support vector machine classifier a while ago; I may dust it off and give that technique a shot.

There are some aspects of this data set that make it both interesting and difficult to deal with.

a) It is multilingual (and unlabelled). There are quite a few pages in Spanish, Italian and German that I have grabbed, and these add pure noise. Not only are many of the words in these documents unique (compared to the english portion of the corups), but we lose the efficiency boost of stopwords. A friend of mine (another FOO camper, Maciej Ceglowski) has written a language iddentification Perl module. If I can use that to "weed out" non english pages, that would be a help.

b) The tags are unstructured. It is common to see tags like "linux/hardware" or "design+awful", which implies a heirarchy or structure that other users do not use.

c) The taggers are heterogenous and untrained. Most text classification depends on categories that are stable, orthogonal and uniformly applied. However, delicious users tag urls inconsistently and subjectively. For example, what one delicious user might tag as "funny" is not necessarily what someone else would. Or, in a better example, one user may tag a URL as "nl," short for "neurolingusitics" while someone else may use that tag to mean "the Netherlands."

However, some of these bugs are features. For example, a given URL might be bookmarked (and tagged) by multiple users, and we should be able to leverage this information. For example, if "" has ten tags total and nine of them are the tag "blog" and one of them is "silly", then boingboing is clearly "blog-ier" than it is "silly." Currently, I only consider unique tags for classification.

Posted by Jeff | Categories: Project updates | Permalink

November 13, 2004 1:19 PM

Delicious Tags

A few months ago at FOO camp, I gave a small talk about machine learning, both a technical overview of some algorithms (Bayesian learning, SVM, neural nets and so forth) as well as some practical applications on text-based data. After this talk, I became interested in the idea of analyzing the data with these same machine learning tools. Specifically, I wanted to see if I could implement a good "URL classifier" or "URL recommendation" engine.

A "URL classifier" is some process or algorithm that, given a new URL, generates a reasonable set of tags describing it. Similarly, a "URL recommender" recommends a url that is similar to one you've already tagged.

Although the two tasks are, in some sense, sides of the same coin I decided to tacke the second task first because it seemed easier. "If we tag page A with 'apples' and 'alchemy'," the naive reasoning goes, "couldn't we just offer up another page that someone else has tagged with 'apples' and 'alchemy' as well?" This works for many simple cases, but runs into two problems:

1. Ambiguity of tags. The first example of ambiguity I ran into was during my own searches for papers on natural language parsing. Natural language is frequently abbreviated, and thus tagged, as "nl." Unfortunately, so are many pages about the Netherlands. So, the lesson here is that the meaning of a specific tag depends on the meaning of the page it is tagging (obviously), and the meaning of the tags around it.

2. Different taste in tags. In short, it is possible (even likely) that two different users will tag the same page with two completely different sets of tags.

To be incredibly simplistic, the first problem will result in false positives -- a wrong page being presented as a correct one -- and the second problem will result in false negatives -- no page being returned when there might in fact be excellent matches out there.

Thus, after some study and experiments I determined that tags, while a valuable classifier of data, are "noisy" enough that they cannot be solely depended upon for useful recommendation.

Next: The URL classifier problem, and how it encounters the same difficulties.

Posted by Jeff | Permalink | Categories: New project

September 21, 2004 3:01 PM

Tangled up in FOO

I went to FOO camp two weekends ago, and ended up having a pretty good time. I was worried, at first, that my interests and skills (computer graphics, numerical optimization, and to a less accomplished degree, robotic/new media art) would be irrelevant in the face of the various blog people and "future of SOAP" technical discussions. However, the other attendees were by and large interesting people with wide-ranging minds, and we found a lot of common ground.

Although the people there were sharp and fun to talk to, the conference sessions themselves ended up being a bit weak. They suffered from a certain lack of organization, not only in timing and place, but also in focus. About a third of the talks I went to started out interesting, but kind of wandered off into irrelevent territory. The best example of this being the "Future of Email" talk, which started out with some interesting input on "email as a message" vs. "email as instigating a validated channel of communcation", and degenerated into a room of people listlessly watching some guy in a cloak doing nslookups on his laptop. However, I think disorganization is the price you pay for an ad hoc conference and, as disorganization goes, FOO wasn't bad at all.

Not to end on a negative note, I had some very good times chatting with Tim Anderson, who co-founded Z Corporation, a 3D printer company, MIT media lab art freaks Cameron Marlow and Jonah Peretti, as well as my friends Joshua Schachter and Maciej Ceglowski.

Posted by Jeff Smith | Permalink | Categories: News

August 06, 2004 7:42 PM

Another Old Project

After a long hiatus, I've added another new project to my projects section. Traces, which I worked on during late 1998 and 1999, was an art project that explored the ideas of presence in virtual spaces. In addition to helping to develop the ideas, I did the graphics programing for the CAVE 3D display. Feel free to download the code, but please let me know if you use it.

Posted by Jeff Smith | Permalink | Categories: New project

March 06, 2004 1:36 PM

Two old projects added

I've blocked out the style of "project" entries, and added two of my more recent peices of research to the projects page. For now, the source code download links are dead -- I haven't uploaded the code to smokingrobot from my research machine yet. Also, I'm undecided whether I should give away my research code. I may want to resturn to some of these projects some day, and giving away the code kind of gives away any "advantage" I have over other researchers. Also, I don't want to get into the "I downloaded your code and it doesn't work. Help me!" hellpit that offering code for download opens.

Posted by Jeff Smith | Permalink | Categories: New project