<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns="http://purl.org/rss/1.0/">

<channel rdf:about="http://www.smokingrobot.com/news/index.rdf">
	<title>Robot Uprising</title>
	<link>http://www.smokingrobot.com/news</link>
	<description>Robots, Canadians, Art</description>
	<dc:language>en-us</dc:language>
	<dc:creator>Jeff Smith</dc:creator>
</channel>

<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2007-06-16T10_29_01.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2007-06-16T10_29_01.html</link>
	<title>Naive Bayes Classification</title>
	<dc:creator>Jeff</dc:creator>
	<description><![CDATA[<p>I've put together a quick summary of I've been playing with for the
past few weeks.  I have the set of all URLs (and their tags)
bookmarked at delicious as of 22 January 2004. Before FOO camp, I
downloaded the text from these urls (not following any links and not
downloading images or flash), removed the HTML tags and cleaned the
resulting text a bit (removing non-"typewriter" characters).  This
resulted in about 17000 different texts, varying in length from 100
bytes (my arbitrary lower limit) to 1 megabyte (a significant
outlier).
</p><p>
I then took a median-length subset of these texts (about 2000) and
cleaned their tags.  This involved some semiautomatic stemming
("blogs" became "blog", etc) as well as some spelling corrections.  I
set aside 500 of these texts as "test" data and trained a Naive Bayes
classifier on the rest (which took about 5 minutes).
</p><p>
The resulting classifier could assign a correct tag to the previously
unseen testing texts about 80% of the time.  For example, a given URL
might have the three tags: "blog linux technology".  The NB classifier
would, 80% of the time, tag this URL "blog", "linux" or "technology."
(Becuase of the data sparsity, as well as the numerical behavior of
the probability calculations, the NB classifier isn't really capable
of giving multiple, accurate tags in all cases.)
</p><p>
</p><p>
When I returned from FOO camp, I decided to reimplment the NB
classifier, and re-run it on the entire corups of 17000 web pages.
Unfortunately, the much larger size of this corups meant that I
couldn't clean the tags and texts as thouroughly as before, and the
results have been slightly less encouraging: about 70% accuracy.
Interestingly, about 80% of the misclassifcations were pages being
mis-classified as "blogs," which probably means that "blog" is an
overly broad tag.
</p><p>
The next step for me is probably to either tweak the Naive Bayes
classifier, to see if I can coax better performance out of it, or
perhaps to try a different algorithm.  I implemented a support vector
machine classifier a while ago; I may dust it off and give that
technique a shot.
</p><p>
There are some aspects of this data set that make it both interesting
and difficult to deal with.
</p><p>
a) It is multilingual (and unlabelled).  There are quite a few pages
   in Spanish, Italian and German that I have grabbed, and these
   add pure noise.  Not only are many of the words in these documents
   unique (compared to the english portion of the corups), but we lose
   the efficiency boost of stopwords.  A friend of mine (another FOO
   camper, Maciej Ceglowski) has written a language iddentification 
   Perl module. If I can use that to "weed out" non english pages,
   that would be a help.
</p><p>
b) The tags are unstructured.  It is common to see tags like
   "linux/hardware" or "design+awful", which implies a heirarchy
   or structure that other users do not use.
</p><p>
c) The taggers are heterogenous and untrained.  Most text classification
   depends on categories that are stable, orthogonal and uniformly
   applied.  However, delicious users tag urls inconsistently and
   subjectively.  For example, what one delicious user might tag as
   "funny" is not necessarily what someone else would.  Or, in a better
   example, one user may tag a URL as "nl," short for "neurolingusitics" 
   while someone else may use that tag to mean "the Netherlands."
</p><p>
</p><p>
However, some of these bugs are features. For example, a given URL
might be bookmarked (and tagged) by multiple users, and we should be
able to leverage this information.  For example, if "boingboing.com"
has ten tags total and nine of them are the tag "blog" and one of them
is "silly", then boingboing is clearly "blog-ier" than it is "silly."
Currently, I only consider unique tags for classification.</p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2007-06-16T10_29_01.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2007-06-16T10_29_01.html</link>
	<title>Naive Bayes Classification</title>
	<dc:creator>Jeff</dc:creator>
	<description><![CDATA[<p>I've put together a quick summary of I've been playing with for the
past few weeks.  I have the set of all URLs (and their tags)
bookmarked at delicious as of 22 January 2004. Before FOO camp, I
downloaded the text from these urls (not following any links and not
downloading images or flash), removed the HTML tags and cleaned the
resulting text a bit (removing non-"typewriter" characters).  This
resulted in about 17000 different texts, varying in length from 100
bytes (my arbitrary lower limit) to 1 megabyte (a significant
outlier).
</p><p>
I then took a median-length subset of these texts (about 2000) and
cleaned their tags.  This involved some semiautomatic stemming
("blogs" became "blog", etc) as well as some spelling corrections.  I
set aside 500 of these texts as "test" data and trained a Naive Bayes
classifier on the rest (which took about 5 minutes).
</p><p>
The resulting classifier could assign a correct tag to the previously
unseen testing texts about 80% of the time.  For example, a given URL
might have the three tags: "blog linux technology".  The NB classifier
would, 80% of the time, tag this URL "blog", "linux" or "technology."
(Becuase of the data sparsity, as well as the numerical behavior of
the probability calculations, the NB classifier isn't really capable
of giving multiple, accurate tags in all cases.)
</p><p>
</p><p>
When I returned from FOO camp, I decided to reimplment the NB
classifier, and re-run it on the entire corups of 17000 web pages.
Unfortunately, the much larger size of this corups meant that I
couldn't clean the tags and texts as thouroughly as before, and the
results have been slightly less encouraging: about 70% accuracy.
Interestingly, about 80% of the misclassifcations were pages being
mis-classified as "blogs," which probably means that "blog" is an
overly broad tag.
</p><p>
The next step for me is probably to either tweak the Naive Bayes
classifier, to see if I can coax better performance out of it, or
perhaps to try a different algorithm.  I implemented a support vector
machine classifier a while ago; I may dust it off and give that
technique a shot.
</p><p>
There are some aspects of this data set that make it both interesting
and difficult to deal with.
</p><p>
a) It is multilingual (and unlabelled).  There are quite a few pages
   in Spanish, Italian and German that I have grabbed, and these
   add pure noise.  Not only are many of the words in these documents
   unique (compared to the english portion of the corups), but we lose
   the efficiency boost of stopwords.  A friend of mine (another FOO
   camper, Maciej Ceglowski) has written a language iddentification 
   Perl module. If I can use that to "weed out" non english pages,
   that would be a help.
</p><p>
b) The tags are unstructured.  It is common to see tags like
   "linux/hardware" or "design+awful", which implies a heirarchy
   or structure that other users do not use.
</p><p>
c) The taggers are heterogenous and untrained.  Most text classification
   depends on categories that are stable, orthogonal and uniformly
   applied.  However, delicious users tag urls inconsistently and
   subjectively.  For example, what one delicious user might tag as
   "funny" is not necessarily what someone else would.  Or, in a better
   example, one user may tag a URL as "nl," short for "neurolingusitics" 
   while someone else may use that tag to mean "the Netherlands."
</p><p>
</p><p>
However, some of these bugs are features. For example, a given URL
might be bookmarked (and tagged) by multiple users, and we should be
able to leverage this information.  For example, if "boingboing.com"
has ten tags total and nine of them are the tag "blog" and one of them
is "silly", then boingboing is clearly "blog-ier" than it is "silly."
Currently, I only consider unique tags for classification.</p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2007-02-11T12_24_53.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2007-02-11T12_24_53.html</link>
	<title>Nobody cares</title>
	<dc:creator>Jeff</dc:creator>
	<description><![CDATA[<p>Over the past few years, blogs have become the biggest fad on the
internet since porn.  And like its slutty older sister, blogging has
hit the mainstream big-time and has wormed its way into the lives of
people beyond the circle of sweaty, lonely social outcasts who
sustained its humble youth.
</p><p>
At its core, blogging is merely a public diary.  What makes it "au
courant" (French for "fucking stupid") is that it is the latest and
most expensive way yet invented to flap your piehole at millions of
anonymous rubes.  The barrier to entry, as they [<a href="#1">1</a>]
say, has
been lowered to the ground and the great yawping masses have at last
embraced the webpage and all it represents.  The full flower of
individual opinion, stated loudly and ubiquitously, has bloomed in all
its shrill and stupid majesty.
</p><p>
No longer must you eschew hobbies and relationships and stand on a
street corner yelling at passers-by, leaflets flapping in the wind, in
order to get your message across to the masses.  Instead, you can now
sit comfortably in the living room, eating ice-cream from a bucket and
tell the world exactly how Apple's latest pricing structure has
affected your sex-life.
</p><p>
The anal bleaching of hobbies, blogging is both life wreckingly
narcissistic and breath-takingly painful to watch.  Like their authors,
blogs offer nothing new to the world. No observation is too trite, no
angry denunciation too overwrought and empty to avoid being slathered
across a livejournal page.  Fitting its fatuous lineage of
unicorn-sticker-covered diaries and spiral notebooks covered with
crudely-drawn devil skulls, the average blog contains little more than
a generic set of observations about whatever suburban "scene" the
author haunts, mentions of the inevitable cat (perhaps dressed as a
pirate for Halloween, or impishly photographed with a $700 digital
camera in a hi-larious pose) and a few limp sentences about whatever
band played last weekend.
</p><p>
If I sound like I'm generalizing and glossing over the varied and rich
splendor of blogs, forgive me [<a href="#2">2</a>].
But in reality, blogs are
have all the rich diversity and individuality of a crate of Jello
pudding cups.  The same dozen topics and links circle endlessly among
the listless blog authors/audience like flecks of rot in a slowly
draining sink.
</p><p>
You're a fat goth who was bad-touched by your uncle and you cut
yourself?  Congratulations: there are millions of slack-jawed
butterwhales in whiteface just like you, galumphing around our
benighted nation's stripmalls. Delete your xanga account and shut up.
</p><p>
Or perhaps you're a gamer with a sense of humor and a willingness to
tell it like it is?  Cram it, virgin.  Any observation you make about
the Xbox 360 has been well expressed already by the many thousands of
game publications available anywhere porn, cigarettes and Taschen
books are sold.
</p><p>
Worse yet are political blogs: the online equivalent of that
irritating guy at the local bar who won't shut up about Clinton.  Like
those ranting, urine-scented wrecks, political bloggers appear to
believe that shuffling around second- and third-hand opinions is a
fine substitute for expertise.  Excuse me, madam blogger, but you
haven't changed out of your pajamas in three days. You are not a
beltway insider, and don't have a fucking clue about how the real
world operates. (Hint: people in power don't have 60th level Rouges in
World of Warcraft).
</p><p>
If it's not the diversity of opinion that makes blogs "interesting" to
the media, perhaps its the independent, DIY-aspect of blogging?
Wrong.  Bloggers proudly claim that they are "breakin' the rules" of
so-called "dinosaur media".  As if the pomposity and
college-freshman-level arrogance of this claim wasn't awful enough,
the truth of the matter is that they have it all backward.  Far from
breaking any hidebound rules, bloggers have merely adopted all the
worst aspects of journalism and writing and simultaneously eschewed
the traits that make these professions valuable.  Personal anecdotes
passed off as profound insights?  Check. Ads?  Check.  Ads for porn?
Check, check and CHECK.  A degree from a journalism school?  Uh, the
Starbucks manager is kicking me out of the chair near the outlet, I'll
have to get back to you.
</p><p>
Despite their lack of qualifications, original opinions or, indeed,
even basic literacy [<a href="#3">3</a>]  many
bloggers like to pontificate about how blogs, and by extension
themselves, pose some sort of threat to traditional media; newspapers,
television studios and so forth.  Bullshit. The only threat that blogs
and blogging pose is the danger that they will actually convince
someone in "old media" of their claims.  This is a threat similar to
the one that AOL posed to Time-Warner (or, more accurately, its
stockholders) in 2000.
</p><p>
Journalists and authors, i.e. people who are trained and paid for
writing, got where they are today not because of some secret handshake
or exclusive club that only people in their 40s can join.  No, wait!
Actually, they DID.  But that club is called "having skills and paying
your dues", neither of which will happen while you sit in cafes behind
your sticker-bedecked powerbook.  Getting stoned and blowing a Java
programmer at Burning Man is not the same as going to journalism
school, and writing 1200 words about the latest "wacky" Japanese
cultural trend is not the same as getting a real job.
</p><p>
Finally, and most recently, we have the resurgence of "social
networking" sites; simulacra of real life for people who can't borrow
the keys to mom's leased Escalade. First Friendster (whose pathetic,
incompetent crashing and burning should be an object lesson to every
Vox, Blurt, Jookster and Gazzag[<a href="#4">4</a>] out
there), then Orkut, and now MySpace. I have nothing against people
using new communication media for making friends and meeting
like-minded souls.  If only it stopped there.  But no, the blank page
of ones MySpace profile just begs to be filled with incoherent prose,
like a clean field of snow beckons to the dog with a full bladder.
</p><p>
The blight of MySpace cannot be overstated, not only aesthetically
(who decided that we missed embedded MIDI?)  but also culturally.
Endless browser-crashing pages of 20-something whores in training,
white "thug" poseurs and horrible, horrible bands is what MySpace and
its ilk offer and, sadly, corporations seem to be buying.  My only
consolation is that all the professed love to Insane Clown Posse, all
the braggadocio about drug use and fumbling, unsatisfying sex, all the
posing, posturing and humiliating self-aggrandizement of adolescence
that we see is being cached and archived forever by the tireless
robots of Google, and will come back to haunt the simpletons who
offered to share it with the world in the first place.  The last laugh
will be a bitter one, but it will be ours.
</p><p>
So what's my point here? My "conclusion", if you can call it that, is
the precise phrase that will be directed at me by the readers of this
frothy rant: "shut the hell up."  Sounds like good advice to me.
</p><p>
</p><p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2004-11-13T13_19_42.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2004-11-13T13_19_42.html</link>
	<title>Delicious Tags</title>
	<dc:creator>Jeff</dc:creator>
	<description><![CDATA[<p>A few months ago at FOO camp, I gave a small talk about machine
learning, both a technical overview of some algorithms (Bayesian
learning, SVM, neural nets and so forth) as well as some practical
applications on text-based data.  After this talk, I became interested
in the idea of analyzing the del.icio.us data with these same machine
learning tools. Specifically, I wanted to see if I could implement a
good "URL classifier" or "URL recommendation" engine.
</p><p>
A "URL classifier" is some process or algorithm that, given a
new URL, generates a reasonable set of tags describing
it.  Similarly, a "URL recommender" recommends a url that is similar to
one you've already tagged.
</p><p>
Although the two tasks are, in some sense, sides of the same coin
I decided to tacke the second task first because it seemed easier.
"If we tag page A with 'apples' and 'alchemy'," the naive reasoning
goes, "couldn't we just offer up another page that someone else has
tagged with 'apples' and 'alchemy' as well?"  This works for many
simple cases, but runs into two problems:
</p><p>
1. Ambiguity of tags.  The first example of ambiguity I ran into
   was during my own searches for papers on natural language 
   parsing.  Natural language is frequently abbreviated, and
   thus tagged, as "nl."  Unfortunately, so are many pages about
   the Netherlands.  So, the lesson here is that the meaning of
   a specific tag depends on the meaning of the page it is tagging
   (obviously), and the meaning of the tags around it.
</p><p>
2. Different taste in tags.  In short, it is possible (even
   likely) that two different users will tag the same page with
   two completely different sets of tags. 
</p><p>
To be incredibly simplistic, the first problem will result in false
positives -- a wrong page being presented as a correct one -- and the
second problem will result in false negatives -- no page being
returned when there might in fact be excellent matches out there.
</p><p>
Thus, after some study and experiments I determined that tags,
while a valuable classifier of data, are "noisy" enough that they
cannot be solely depended upon for useful recommendation. 
</p><p>
Next: The URL classifier problem, and how it encounters the same
 difficulties.
 </p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2004-09-21T15_01_37.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2004-09-21T15_01_37.html</link>
	<title>Tangled up in FOO</title>
	<dc:creator>Jeff Smith</dc:creator>
	<description><![CDATA[<p>I went to
<a href="http://wiki.oreillynet.com/foocamp04/index.cgi">FOO camp</a>
two weekends ago, and ended up having a pretty good time.  I was
worried, at first, that my interests and skills (computer graphics,
numerical optimization, and to a less accomplished degree, robotic/new
media art) would be irrelevant in the face of the various blog people
and "future of SOAP" technical discussions.  However, the other
attendees were by and large interesting people with wide-ranging
minds, and we found a lot of common ground.
</p><p>
Although the people there were sharp and fun to talk to, the
conference sessions themselves ended up being a bit weak.  
They suffered from a certain lack of organization, not only in timing
and place, but also in focus.  About a third of the talks I went to
started out interesting, but kind of wandered off into irrelevent
territory.  The best example of this being the "Future of Email" talk,
which started out with some interesting input on "email as a message"
vs. "email as instigating a validated channel of communcation", and
degenerated into a room of people listlessly watching some guy in a
cloak doing nslookups on his laptop.
However, I think disorganization is the price you pay for an ad hoc
conference and, as disorganization goes, FOO wasn't bad at all.
</p><p>
Not to end on a negative note, I had some very good times
chatting with <a href="http://www.mit.edu/people/robot/">Tim Anderson</a>,
who co-founded
<a href="http://www.zcorp.com/">Z Corporation</a>, a 3D printer company,
MIT media lab art freaks <a href="http://overstated.net/">Cameron Marlow</a> 
and <a href="http://www.eyebeam.org/">Jonah Peretti</a>, as
well as my friends
<a href="http://www.burri.to:8080/~joshua/">Joshua Schachter</a>
and
<a href="http://www.idlewords.com/">Maciej Ceglowski</a>.</p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2004-08-06T19_42_17.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2004-08-06T19_42_17.html</link>
	<title>Another Old Project</title>
	<dc:creator>Jeff Smith</dc:creator>
	<description><![CDATA[<p>After a long hiatus, I've added another new project
to my
<a href="http://www.smokingrobot.com/projects/index.html">projects</a>
section.  Traces, which I worked on during late 1998 and 1999, was an
art project that explored the ideas of presence in virtual spaces.
In addition to helping to develop the ideas, I did the graphics 
programing for the CAVE 3D display.  Feel free to download the code,
but please let me know if you use it.</p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2004-03-06T13_36_25.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2004-03-06T13_36_25.html</link>
	<title>Two old projects added</title>
	<dc:creator>Jeff Smith</dc:creator>
	<description><![CDATA[<p>I've blocked out the style of "project" entries, and added two of my
more recent peices of research to the
<a href="http://www.smokingrobot.com/projects/index.html">projects page</a>.
For now, the source code download links are dead -- I haven't uploaded
the code to smokingrobot from my research machine yet. Also, I'm
undecided whether I should give away my research code. I may want to
resturn to some of these projects some day, and giving away the code
kind of gives away any "advantage" I have over other
researchers. Also, I don't want to get into the "I downloaded your
code and it doesn't work. Help me!"  hellpit that offering code for
download opens.</p>]]></description>
</item>
<item rdf:about="http://www.smokingrobot.com/news/archives/permalinks/2004-03-06T13_33_46.html">
	<link>http://www.smokingrobot.com/news/archives/permalinks/2004-03-06T13_33_46.html</link>
	<title>Testing nanoblogger</title>
	<dc:creator>Jeff Smith</dc:creator>
	<description><![CDATA[<p>I've been having a lot of trouble with nanoblogger's categories, so
I erased all my entries and will try again from scratch.</p>]]></description>
</item>

</rdf:RDF>
