Sunday, January 27, 2008

Danger: deadly hobbies!

I'm not familiar with the American blogosphere (I hope blogging in English will help discovering it), but there is a blog there I often visit, xkcd, full of witty or funny comics... for a somewhat restricted audience (I mean as geek as their author, Randall Munroe, who also created some other pretty nice stuff).

I especially enjoyed one of his latest drawings which uses Google result numbers, as I've already done for spelling, congressmen celebrity, or the birthdate of the web :
This picture created a slashdotted Google Bomb as the number of Google answers for "died in a blogging accident" exploded. Of course lots of bloggers felt very concerned and cited the picture while adding results of their own Google searches on the same principle. That website, and the xkcd forum show numerous attempts to find unusual dangerous activities.

But couldn't we just submit Google a list of all English verbs, and let it tell us which one creates most deadly accidents? Of course, here comes the method I used, then the results.

Step 1, retrieving a list of all English verbs. Quite painful, as you can see in these 404-ridden Google Answers, or those 5 pages of outdated or useless answers in a forum... I decided to trust my favorite search engine, and sent it a list of all verbs that went through my mind. Too bad, it replied with complete dictionaries, so I had to forbid some noun, hat, and eventually, on page 3 for -hat strike give abandon wipe rub search seek hang eat adjust draw conclude reappear reconsolidate create destroy dream cut put drive, I got to a page of the VerbNet project with more than 3500 files named from verbs. If you have better, just give your link in the comments!

Step 2, generating the present participles. Verb+ing ? Yeah, but not exactly, I'm quite proud of the following spreadsheet formula which generates almost always the correct form (to avoid making mistakes of course I split it into many cells, but it's juste so impressive to read it entirely) :
B1=IF(RIGHT(A1;1)="e";=IF(LEFT(RIGHT(A1;2);1)="i";CONCATENATE(LEFT(A1;LEN(A1)-2);"ying");CONCATENATE(LEFT(A1;LEN(A1)-1);"ing"));=IF(OR(RIGHT(A1;1)="d";RIGHT(A1;1)="g";RIGHT(A1;1)="m";RIGHT(A1;1)="n";RIGHT(A1;1)="p";RIGHT(A1;1)="t");=IF(OR(LEFT(RIGHT(A1;2);1)="a";LEFT(RIGHT(A1;2);1)="e";LEFT(RIGHT(A1;2);1)="i";LEFT(RIGHT(A1;2);1)="o";LEFT(RIGHT(A1;2);1)="u");=IF(OR(LEFT(RIGHT(A1;3);1)="a";LEFT(RIGHT(A1;3);1)="e";LEFT(RIGHT(A1;3);1)="i";LEFT(RIGHT(A1;3);1)="o";LEFT(RIGHT(A1;3);1)="u";AND(LEFT(RIGHT(A1;2);1)="e";RIGHT(A1;1)="n"));CONCATENATE(A1;"ing");CONCATENATE(A1;RIGHT(A1;1);"ing"));CONCATENATE(A1;"ing"));CONCATENATE(A1;"ing")))

Ok, right, a little explanation. If the last letter is an "e" then:
  • if the letter before is an "i", I transform into "ying" (die -> dying)
  • otherwise, I delete the "e", and add "ing" (love -> loving)
otherwise:
  • if the verb ends with "en", I just add "ing" (sharpen -> sharpening)
  • otherwise, if the next-to-last letter is a "d", "g", "m", "n", "p", "t", I double it if there is a vowel just before, which is not preceded by a vowel (bid -> bidding, put -> putting, but claim -> claiming, feed -> feeding)
  • otherwise I just add "ing" (speak -> speaking)
I've created those rules intuitively, apparently to double the final consonant one has to check whether the last syllabus is stressed or not, my version has a tiny number of exceptions, I just identified verbs ending with "on" (abandon -> abandonning, d'oh, even if con -> conning is correct).

Step 3, put around each participle "died in a on the left (or "died in an if the verb starts with a vowel) and accident" on the right, and send each of those expressions to Google, using my tool (in French, but it's not as if it wasn't super-intuitive) FuryPopularity. I've just updated the program, because Google changed the style of its results, and apparently its spam detection is tougher: after 200 requests separated by 5 second intervals, it just blacklisted me, I could search back only after a captcha. Apparently 10 second intervals are ok. If you know anything about their detection algorithm I'm very interested: is it just about the frequency (if it is, do they have to identify proxys?) ? Do they carefully check the period?

Here is the tagcloud of the words which happened to get more than one result:
If you check the words which do not appear frequently, you unfortunately do not always find contestants for the Darwin Awards. First, some parasite links from reactions about the xkcd picture, or animal deaths, but also some more annoying things: participial adjectives (amusing, embarrassing, interesting...) and verbs which do not express an activity, rather circumstances (exploding, crushing, choking...). For the latter, I have no solution. But it's quite easy to remove the participial adjectives automatically. Of course you can do it with a syntactic parser, or even a dictionary but I prefer to go on with Google result numbers.

I made a few tries before finding a nice criterion. Comparing the frequency of the participle form with the infinitive form (hoping it will be greater for participial adjectives) or computing the occurrence percentages of the participle just after "a", "more", or "most". On the graph on the left, the first 5 verbs give participial adjectives. We can see that the "a ..." strategy fails, because of the inclusion of participles into nouns: "a frying pan" explains why "a frying" is so frequent. Anyway "most ..." seems to help making the distinction:

Once those participial adjectives have been filtered, one can count not only the number of "died of a ... accident", but also "a ... accident", as well as the number of answers for the participle itself to get things like accident rates (blue) and death rates (red) :If your hobby is not in the list, at least you have a basis to compare it. If it is, well, be careful, especially if you plan on jousting next weekend!


This post was originally published in French: Danger : accidents mortels !
As usual, the source files: list of more than 3000 English verbs and their computed present participle, testing Google detection of participial adjectives, results of Google requests.

3 comments:

MrCopilot said...

Excellent Work.

knowsit said...

You are a prick for doing this. The fun of trying to find a phrase that nobody else has used is gone. Your stupid document appears as the only link for thousands of different combinations.

Vous êtes un piquer pour le faire. Le plaisir d'essayer de trouver une phrase que personne d'autre n'a utilisé est révolue. Votre stupide document apparaît comme le seul lien pour des milliers de combinaisons différentes.

Philippe said...

Neil, your lack of imagination is so disappointing, what about verbs that I didn't try (invented recently for example), verbal expressions ("died in an ice skating accident"), adding an object or adjective to the verb ("died in a silly criticizing accident") or just using a noun instead of a verb ("died in a car accident")?

And I did think of those who still want to play (and have enough imagination) by citing this link in my post.