A few months ago at FOO camp, I gave a small talk about machine learning, both a technical overview of some algorithms (Bayesian learning, SVM, neural nets and so forth) as well as some practical applications on text-based data. After this talk, I became interested in the idea of analyzing the del.icio.us data with these same machine learning tools. Specifically, I wanted to see if I could implement a good "URL classifier" or "URL recommendation" engine.
A "URL classifier" is some process or algorithm that, given a new URL, generates a reasonable set of tags describing it. Similarly, a "URL recommender" recommends a url that is similar to one you've already tagged.
Although the two tasks are, in some sense, sides of the same coin I decided to tacke the second task first because it seemed easier. "If we tag page A with 'apples' and 'alchemy'," the naive reasoning goes, "couldn't we just offer up another page that someone else has tagged with 'apples' and 'alchemy' as well?" This works for many simple cases, but runs into two problems:
1. Ambiguity of tags. The first example of ambiguity I ran into was during my own searches for papers on natural language parsing. Natural language is frequently abbreviated, and thus tagged, as "nl." Unfortunately, so are many pages about the Netherlands. So, the lesson here is that the meaning of a specific tag depends on the meaning of the page it is tagging (obviously), and the meaning of the tags around it.
2. Different taste in tags. In short, it is possible (even likely) that two different users will tag the same page with two completely different sets of tags.
To be incredibly simplistic, the first problem will result in false positives -- a wrong page being presented as a correct one -- and the second problem will result in false negatives -- no page being returned when there might in fact be excellent matches out there.
Thus, after some study and experiments I determined that tags, while a valuable classifier of data, are "noisy" enough that they cannot be solely depended upon for useful recommendation.
Next: The URL classifier problem, and how it encounters the same difficulties.
After a long hiatus, I've added another new project to my projects section. Traces, which I worked on during late 1998 and 1999, was an art project that explored the ideas of presence in virtual spaces. In addition to helping to develop the ideas, I did the graphics programing for the CAVE 3D display. Feel free to download the code, but please let me know if you use it.
I've blocked out the style of "project" entries, and added two of my more recent peices of research to the projects page. For now, the source code download links are dead -- I haven't uploaded the code to smokingrobot from my research machine yet. Also, I'm undecided whether I should give away my research code. I may want to resturn to some of these projects some day, and giving away the code kind of gives away any "advantage" I have over other researchers. Also, I don't want to get into the "I downloaded your code and it doesn't work. Help me!" hellpit that offering code for download opens.