A few months ago at FOO camp, I gave a small talk about machine learning, both a technical overview of some algorithms (Bayesian learning, SVM, neural nets and so forth) as well as some practical applications on text-based data. After this talk, I became interested in the idea of analyzing the del.icio.us data with these same machine learning tools. Specifically, I wanted to see if I could implement a good "URL classifier" or "URL recommendation" engine.
A "URL classifier" is some process or algorithm that, given a new URL, generates a reasonable set of tags describing it. Similarly, a "URL recommender" recommends a url that is similar to one you've already tagged.
Although the two tasks are, in some sense, sides of the same coin I decided to tacke the second task first because it seemed easier. "If we tag page A with 'apples' and 'alchemy'," the naive reasoning goes, "couldn't we just offer up another page that someone else has tagged with 'apples' and 'alchemy' as well?" This works for many simple cases, but runs into two problems:
1. Ambiguity of tags. The first example of ambiguity I ran into was during my own searches for papers on natural language parsing. Natural language is frequently abbreviated, and thus tagged, as "nl." Unfortunately, so are many pages about the Netherlands. So, the lesson here is that the meaning of a specific tag depends on the meaning of the page it is tagging (obviously), and the meaning of the tags around it.
2. Different taste in tags. In short, it is possible (even likely) that two different users will tag the same page with two completely different sets of tags.
To be incredibly simplistic, the first problem will result in false positives -- a wrong page being presented as a correct one -- and the second problem will result in false negatives -- no page being returned when there might in fact be excellent matches out there.
Thus, after some study and experiments I determined that tags, while a valuable classifier of data, are "noisy" enough that they cannot be solely depended upon for useful recommendation.
Next: The URL classifier problem, and how it encounters the same difficulties.