Sunday, January 27, 2008

Danger: deadly hobbies!

I'm not familiar with the American blogosphere (I hope blogging in English will help discovering it), but there is a blog there I often visit, xkcd, full of witty or funny comics... for a somewhat restricted audience (I mean as geek as their author, Randall Munroe, who also created some other pretty nice stuff).

I especially enjoyed one of his latest drawings which uses Google result numbers, as I've already done for spelling, congressmen celebrity, or the birthdate of the web :
This picture created a slashdotted Google Bomb as the number of Google answers for "died in a blogging accident" exploded. Of course lots of bloggers felt very concerned and cited the picture while adding results of their own Google searches on the same principle. That website, and the xkcd forum show numerous attempts to find unusual dangerous activities.

But couldn't we just submit Google a list of all English verbs, and let it tell us which one creates most deadly accidents? Of course, here comes the method I used, then the results.

Step 1, retrieving a list of all English verbs. Quite painful, as you can see in these 404-ridden Google Answers, or those 5 pages of outdated or useless answers in a forum... I decided to trust my favorite search engine, and sent it a list of all verbs that went through my mind. Too bad, it replied with complete dictionaries, so I had to forbid some noun, hat, and eventually, on page 3 for -hat strike give abandon wipe rub search seek hang eat adjust draw conclude reappear reconsolidate create destroy dream cut put drive, I got to a page of the VerbNet project with more than 3500 files named from verbs. If you have better, just give your link in the comments!

Step 2, generating the present participles. Verb+ing ? Yeah, but not exactly, I'm quite proud of the following spreadsheet formula which generates almost always the correct form (to avoid making mistakes of course I split it into many cells, but it's juste so impressive to read it entirely) :

Ok, right, a little explanation. If the last letter is an "e" then:
  • if the letter before is an "i", I transform into "ying" (die -> dying)
  • otherwise, I delete the "e", and add "ing" (love -> loving)
  • if the verb ends with "en", I just add "ing" (sharpen -> sharpening)
  • otherwise, if the next-to-last letter is a "d", "g", "m", "n", "p", "t", I double it if there is a vowel just before, which is not preceded by a vowel (bid -> bidding, put -> putting, but claim -> claiming, feed -> feeding)
  • otherwise I just add "ing" (speak -> speaking)
I've created those rules intuitively, apparently to double the final consonant one has to check whether the last syllabus is stressed or not, my version has a tiny number of exceptions, I just identified verbs ending with "on" (abandon -> abandonning, d'oh, even if con -> conning is correct).

Step 3, put around each participle "died in a on the left (or "died in an if the verb starts with a vowel) and accident" on the right, and send each of those expressions to Google, using my tool (in French, but it's not as if it wasn't super-intuitive) FuryPopularity. I've just updated the program, because Google changed the style of its results, and apparently its spam detection is tougher: after 200 requests separated by 5 second intervals, it just blacklisted me, I could search back only after a captcha. Apparently 10 second intervals are ok. If you know anything about their detection algorithm I'm very interested: is it just about the frequency (if it is, do they have to identify proxys?) ? Do they carefully check the period?

Here is the tagcloud of the words which happened to get more than one result:
If you check the words which do not appear frequently, you unfortunately do not always find contestants for the Darwin Awards. First, some parasite links from reactions about the xkcd picture, or animal deaths, but also some more annoying things: participial adjectives (amusing, embarrassing, interesting...) and verbs which do not express an activity, rather circumstances (exploding, crushing, choking...). For the latter, I have no solution. But it's quite easy to remove the participial adjectives automatically. Of course you can do it with a syntactic parser, or even a dictionary but I prefer to go on with Google result numbers.

I made a few tries before finding a nice criterion. Comparing the frequency of the participle form with the infinitive form (hoping it will be greater for participial adjectives) or computing the occurrence percentages of the participle just after "a", "more", or "most". On the graph on the left, the first 5 verbs give participial adjectives. We can see that the "a ..." strategy fails, because of the inclusion of participles into nouns: "a frying pan" explains why "a frying" is so frequent. Anyway "most ..." seems to help making the distinction:

Once those participial adjectives have been filtered, one can count not only the number of "died of a ... accident", but also "a ... accident", as well as the number of answers for the participle itself to get things like accident rates (blue) and death rates (red) :If your hobby is not in the list, at least you have a basis to compare it. If it is, well, be careful, especially if you plan on jousting next weekend!

This post was originally published in French: Danger : accidents mortels !
As usual, the source files: list of more than 3000 English verbs and their computed present participle, testing Google detection of participial adjectives, results of Google requests.

Thursday, January 17, 2008

Britney-Amy : Celebrity Deathmatch!

Discovered on French television last week, and provide an interesting challenge: guess when the two divas will die, the closest wins an iPod Touch, or a PS3. Huge buzz, of course, thousands of people went there to take a chance and leave a pre-condolences message. Both sites are of course optimized to make money with ads (contrary to this more confidential game, which is just as sweet though: the "TopMort", where you can pick up people you think will die within the year, and who have not been ), and only provide raw data entered by people who signed. No stats at all, what a shame.

I was very lucky that Matthieu Muffato, a friend who happens to be an impressive Python expert, used a few code lines and some execution hours to retrieve the data and mail it to me.

The initial question I had about it was simple: what is the biggest time interval not yet chosen, which would a priori maximize the chance to win? By "a priori", I mean considering any time interval of some fixed length is uniformly dangerous for Amy and Britney, and uniformly chosen by other visitors.

Unfortunately, those ideal conditions are far from being true in the real world, for a very simple reason: the visitor wants his iPod or PS3 right now, not in 30 years! So if you wish to target a month that has not yet been chosen, for Britney, you will have to wait for February 2023. For Amy, there has been less voters yet, so if nothing has changed since data was retrieved, november 2016 is still available, or you can try year 2031 as only october was chosen then. I must add, as Matthieu told me, that those websites contain no date after January 2038, probably because of some date coding problem. Now let's move on to more serious stuff, here is an overview of the number of votes per month (with a simple vertical normalization for Amy who received less votes, sorry for the title in French...) :
I guess you are as flabbergasted as I was when the curves appeared: they are almost identical! Correlation coefficient equals 0.98, we get the same power law! We can check that it is indeed a power law using a log-log dotplot, which also gives us approximately the equation Y=4-3X, in logarithmic coordinates, that is when we go back to linear: y = 10 000 - x^(4/3), which is the equation of the pale blue curve.

In fact power laws are everywhere in real data, (especially in small-world graphs which have a power law degree distribution). What is surprising here is that both laws have approximately the same parameters. If we check the details we can notice however that voters prefered 2008 for Britney and 2009 for Amy.

By checking the curves carefuly, one also notices some kind of periodicity. At least they are not monotone, and I've put on the left a representation of the percentage of votes per month, each year from 2008 to 2013, for Miss Winehouse. Variations are quite strange, as august attracts twice as much voters as november! I don't have any explanation for those smaller choices of Novembre, December and February, it may be a mechanism similar to what Knuth describes in one of the first exercises of Volume 2: ask a friend (or an enemy) a random digit, he will more probably say 7.

Here is the representation of the choices per day, for any year. I've removed January 1 which was artificially big (due to the year coding problem, which gave a lot of 01/01/1970).
We can observe a new surprising periodicity phenomenon: voters prefer the middle of the month. Note also the vicious voters who chose February 14, poor Britney! Even the dot of her birthday, December 2, is quite high compared to its neighbors...

So to forget about those sad things, let's end with emotion and poetry, here are the pre-condolences tag clouds (made with Freecorp TagCloud Builder) for both stars.

This post originally appeared in French: Britney-Amy, duel mortel.

Vote spreadhseet file by day, by month, contact me if you would like to get other source files.

Wednesday, January 16, 2008

What does veronising mean?

Well, to get some idea of what veronising is, maybe you should check Jean Veronis's blog. My definition would be "to design and publish on a blog programs or methods able to help analyzing data". Jean has created a whole bunch of useful tools, which work mainly on texts (he is a researcher in natural language processing) or internet corpuses (search engines results for example). Among the most impressive, the Nébuloscope, which makes tag clouds out of words appearing frequently in the results of a search engine request, or the Chronologue, which used to draw the evolution of a keyword use on the internet (it used the "date" function of a search engin which has now disappeared).

Inspired by his impressive results, I've started to analyze data I find interesting myself, and program some little tools to help me do that. I may translate some of my previous posts, here are some topics I've worked on, I put the links to French posts until they are translated to English.

Phylogenetic trees are used to represent the evolution of species, based on the idea that some species close to each other will appear in a same subtree, and a lot of algorithms exist to build them from biology data. But phylogenetic trees are also an excellent mean of visualizing data, and I've tried building the trees of country votes at the Eurovision song contest, French "députés" (our congressmen) according to their proximity of votes (as well as a DNA chip visualization of those votes), and more recently I've been working on building what I call a "tree cloud" from a text, the same idea than a tag cloud except the order of the words is not alphabetical, but they are displayed as leaves of a tree. Until the program is finished, I still rely on tag clouds (with nice colors and a logarithmic scale, pleaaase, not those ugly and unexpressive ones we often find on the internet !). I've tried using them to analyze one's writing style (with instant messaging logs) or speaking style (with the planned version and the pronounced version of a press conference talk by President Sarkozy).
I like doing some search engine statistics, to help spelling, visualize and date the birth of the web, or send massive requests to compare popularity of people or concepts. Those stats analyzes often make critical use of spreadsheet programs, which also helped me to track the evolution of a petition, which gave me a glance on the time of the day people connect to the internet depending on their job (students, teachers, engineers...). I could also get nice synthesis pictures of French polls before the first round of the presidential election, in 2002 and 2007. I'm very interested in informative and original visualizations, like Voronoi diagrams (for McDonald's restaurants in Paris) or metro map views (building them from a genuine metro map is a GI-complete problem).

I have also analyzed a blog meme last year, the "Z-list", which in France appeared as "la F-list". Even if I did not publish my data on the "Z-list", I still have the files, as well as the "infection tree", on my computer somewhere. This year I've created a little utility, the "CaptuCourbe", to put data from the picture of a curve into a spreadsheet file (some "unscan" programs do this but they are quite complicated to use, or expensive), which helps comparing the evolution of a buzz on many buzz tracking systems (Google Trends, Technorati, site stats systems...). Currently the program is in French only, but Jean motivated me to translate it to English, which will soon be done.

And you will never guess the topic of my most visited blog post, which I'm not the most proud of: I had noticed a bug on some French TV channel website which gave access to the channel live on the internet. It lasted about 3 days, but since then Google sends me all people who want to watch "M6" on the web. I've put links to other French channels which can be viewed free anyway, to avoid frustration.

See you soon for some new computer-powered experimentations!