tag:blogger.com,1999:blog-53717837165524364002024-03-08T14:53:06.174+01:00Veronising...Computer-powered experimentationsPhilippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-5371783716552436400.post-56188225870400083652008-12-13T17:12:00.000+01:002008-12-13T17:12:31.545+01:00Xkcd in French<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://xkcd.com/233/"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://www.lirmm.fr/%7Egambette/xkcd/static/examplefrench.png" alt="" border="0" /></a><a href="http://www.xkcd.com/">xkcd</a> is really great to illustrate your lectures <span style="font-size:78%;">(or to <a href="http://veronising.blogspot.com/2008/01/danger-deadly-hobbies.html">spend some time on weekends</a>)</span>, when you teach computer science. But not all French students understand English well enough. Well, sometimes I don't understand some subtle lines myself...<br /><br />So, I've prepared <a href="http://www.lirmm.fr/%7Egambette/xkcd">a web interface to translate xkcd comics to French</a>. The idea is to add the translation below the drawings (modifying the drawing would be quite difficult to do), and the idea is that:<br /><ul><li>everybody can submit a new translation, or a better one,</li><li>links can be inserted inside the translation, to give some reference which would be obvious to Americans, but not the French readers.<br /></li><li>moderators choose the best translation.</li></ul>The website is written in PHP/MySql, with a structure close to my <a href="http://veronising.blogspot.com/2008/07/interactive-book-lisbon-by-pessoa.html">website on Pessoa's tourist guide for Lisbon</a><a href="http://veronising.blogspot.com/2008/07/interactive-book-lisbon-by-pessoa.html"></a>. It can be easily adapted to other languages <span style="font-size:78%;">(<a href="http://www.lirmm.fr/%7Egambette/PersoContact.php">contact me</a> if you're interested)</span>.<br /><br />So now it's your turn, <a href="http://www.lirmm.fr/%7Egambette/xkcd/"><span style="font-weight: bold;">translate the comics!</span></a> And don't forget to translate the title and the alt-text, which is sometimes the most tricky part!<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.lirmm.fr/%7Egambette/xkcd/"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 600px; height: 373px;" src="http://www.lirmm.fr/%7Egambette/xkcd/static/screenshot.png" alt="" border="0" /></a><br /><br />More details on this project <a href="http://gambette.blogspot.com/2008/12/xkcd-en-franais.html">here</a> (in French).Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0tag:blogger.com,1999:blog-5371783716552436400.post-47530881834898685942008-07-11T02:13:00.006+02:002008-07-11T17:14:56.767+02:00Interactive Book: Lisbon by Pessoa<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lisbon.pessoa.free.fr/"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://lisbon.pessoa.free.fr/Lisbon.png" alt="" border="0" /></a>The link has been present for some time in the menu on the left, but the project is finally mature enough to be presented. In 1925, <a href="http://en.wikipedia.org/wiki/Fernando_Pessoa">Fernando Pessoa</a>, the famous Portuguese poet, wrote a touist guide about the city he almost never left: Lisbon. A text with no poetic intention, written directly in English <span style="font-size:85%;">(the complete title is <span style="font-style: italic;">Lisbon, what the tourist should see</span>)</span>, to tell the world about the marvels of his beloved city. They were quite preserved in the XX<sup>th</sup> century, and modulo renaming, most of the monuments cited and their descriptions have not changed since. The guide was therefore translated to many languages after being discovered in the end of the 90s in the author's <a href="http://www.disquiet.com/pessoa.html">manuscript "trunk"</a>.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.amazon.com/gp/product/190570075X?ie=UTF8&tag=lisbbypess-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=190570075X"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://lisbon.pessoa.free.fr/ENGLISH2008Mini.jpg" alt="" border="0" /></a>The text was first published in a bilingual English-Portuguese version <a href="http://www.livroshorizonte.pt/catalogo_detalhe.php?idLivro=895">by Livros Horizonte</a>. Unfortunately, this edition contains the original text without editor 's notes or index, and with only a map of the city in 1929, rather difficult to read. Therefore, it can hardly be used to find information while visiting Lisbon. However the text was <a href="http://www.shearsman.com/pages/books/catalog/2008/pessoa_lisbon.html">reedited this year in English by a British publisher, Shearsman</a>, with some more content. They updated the names of the places and persons into their modern style, and added some photos of the city from postcards from the 20s.<br /><br />To make this guide even more useful for the tourist, I've created an <a style="font-weight: bold;" href="http://lisbon.pessoa.free.fr/">interactive version of <span style="font-style: italic;">Lisbon, what the tourist should see</span></a><span style="font-weight: bold;">, with a Google map, and some photos</span> taken during a lovely week spent in the city, as well as some found on the <a href="http://pt.wikipedia.org/">Wikipedia</a> or <a href="http://www.flickr.com/">Flickr</a>.<br /><br />I scanned the <a href="http://www.livroshorizonte.pt/catalogo_detalhe.php?idLivro=895">Livros Horizonte version of the book</a> which I had just brought back from Lisbon, performed optical character recognition with <a href="http://www.clubic.com/telecharger-fiche9843-simpleocr.html">SimpleOcr</a> <span style="font-size:78%;">(not very reliable, but free...)</span>, to get the complete text which is now freely available <span style="font-size:78%;">(Pessoa has been dead <a href="http://en.wikipedia.org/wiki/Copyright#Duration">for more than 70 years</a>)</span> at:<br /><div style="text-align: center;"><a style="font-weight: bold;" href="http://lisbon.pessoa.free.fr/"><blockquote>http://lisbon.pessoa.free.fr<br /></blockquote></a></div><div style="text-align: left;"><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lisbon.pessoa.free.fr/InteractiveMap.php"><img style="margin: 0pt 10pt 10px 0px; float: left; cursor: pointer;" src="http://lisbon.pessoa.free.fr/LisbonInteractiveMap.jpg" alt="" border="0" /></a>Places and streets cited in the guide were then localized on a map of Lisbon, to get a <span style="font-weight: bold;">geographic visualization of the book</span>, where Pessoa gives in fact three circuits <span style="font-size:85%;">- the first one, in blue, is quite long, the other two, green and red, are given for the tourist who"can stay one day more"</span><span style="font-size:85%;">.</span> The book ends with a description of Portuguese journals at that time, then details of some villages in the area. The main (blue) itinerary, which starts from the sea, requires a car. In fact, as it is impossible to make all visits in one day, it can be split into many parts that can be visited on foot or with public transportation. But be careful, in this case, follow the map instead of the ordering of the visits in the book, as the path described there is <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">absolutely not a solution of the TSP</a>! This choice is not random either, as Pessoa distributed the most important visits (the <a href="http://lisbon.pessoa.free.fr/places?id=130">Baixa district</a>, the <a href="http://lisbon.pessoa.free.fr/places.php?id=129">Alfama</a>, the <span style="text-decoration: underline;">Castle of Saint George</span>, the <a href="http://lisbon.pessoa.free.fr/places.php?id=112">Hieronymites Monastery</a>, the<span style="text-decoration: underline;"> </span><a href="http://lisbon.pessoa.free.fr/places?id=3">Tower of Belém</a>, etc) uniformly along his text.<br /></div><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lisbon.pessoa.free.fr/PrinterFriendlyMap.php"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://lisbon.pessoa.free.fr/MapPreview.png" alt="" border="0" /></a>A Google map is quite nice, but not so useful if you travel without internet! <span style="font-size:78%;">By the way if you're looking for an internet connection in Lisbon try <a href="http://lisbon.pessoa.free.fr/places.php?id=49">Rua da Madalena.</a></span> The map is also available in a <a href="http://lisbon.pessoa.free.fr/PrinterFriendlyMap.php">printer-friendly version, with <span style="font-weight: bold;">a number associated to each place</span></a> <span style="font-size:78%;">(don't worry if the page takes some time to load, usually more thant 10 seconds for me ;))</span><span style="font-size:100%;">. To get the labels of those numbers, sorted as they appear in the text, go to the bottom of this <a href="http://lisbon.pessoa.free.fr/PrinterFriendly.php">printer-friendly version of the text</a>.</span><br /><br />If you have access to the interactive version though, you get much more information. For many places, there is a link to its Wikipedia page, or even its official website (with opening hours, for museums).<br /><br />This supplementary information, which transforms this text into an interactive book, has not been added directly to the original text. In fact I created a <span style="font-size:78%;">PHP+MySQL+Javascript</span> system to automatically insert in the text this information which is stored in <a href="http://fr.wikipedia.org/wiki/Base_de_donn%C3%A9es">databases</a>. The picture below tries to explain the principle. Besides the text, there are 3 databases: the <span style="font-weight: bold;">blue</span> one which stores <span style="font-weight: bold;">locations in the text of occurrences of the places</span>, the <span style="font-weight: bold;">orange</span> one stores the <span style="font-weight: bold;">places</span>, and the <span style="font-weight: bold;">purple</span> one stores coordinates. Now let's explain the arrows below. For a given set of coordinates on the map, stored in the purple database, there may be one or many interesting things to see (for example on Praça do Comércio there is also an equestrian statue of King José I). Each of these things to see has a file in the orange database, which gives its name and description, sometimes a photo as well. Note that if you want the website to give you information on places in another language, you just have to translate this database and not the whole site! Finally, to know where all these interesting things appear in the text by Pessoa, the position of the characters where they appear is stored in the blue database. It is then possible that one of them appears at different places in the text, like Praça do Comércio below. If the original text is modified (translated, for example), then this blue database has to be changed too.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lisbon.pessoa.free.fr/LisbonByPessoaTablesENG.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://lisbon.pessoa.free.fr/LisbonByPessoaTablesENG.png" alt="" border="0" /></a><br />To finish the project I still have to complete the orange database (I've currently done <a href="http://lisbon.pessoa.free.fr/PrinterFriendly.php">more than one third</a>). However you can already access everything added so far, especially the text illustrated with photos <a href="http://lisbon.pessoa.free.fr/Pessoa_Lisbon.htm">here</a>. And of course the <a href="http://lisbon.pessoa.free.fr/PrinterFriendlyMap.php">Google map</a>, which is the basic element of this <a href="http://en.wikipedia.org/wiki/Mashup_%28web_application_hybrid%29">mashup</a> <span style="font-size:78%;">(<a href="http://philippe.gambette.free.fr/indexENG.htm">contact-me</a> to get the sources if you have a similar project of interactive book)</span> on a theme by Pessoa.<br /><br />So you can start planning your one week (or more) trip to Lisbon in good company: the one - at least - of some elements, printed or downloaded, from the <a href="http://lisbon.pessoa.free.fr/">site</a>...<br /><br /><object height="110" width="300"><param name="movie" value="http://media.imeem.com/m/yPOUT7X_9L/aus=false/"><param name="wmode" value="transparent"><embed src="http://media.imeem.com/m/yPOUT7X_9L/aus=false/" type="application/x-shockwave-flash" wmode="transparent" height="110" width="300"></embed></object><br /><br /><span style="font-size:85%;"><br />This post was originally published in French: <a href="http://gambette.blogspot.com/2008/07/livre-interactif-lisbonne-par-fernando.html"><span style="font-style: italic;">Livre interactif : Lisbonne par Pessoa.</span></a><br /></span>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0tag:blogger.com,1999:blog-5371783716552436400.post-44498292547966701712008-04-19T08:50:00.000+02:002008-04-19T08:50:31.297+02:00Cooking for nerds: ingredient polyhedron and convex hull<img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/ChouxAuComte.gif" alt="" border="0" />Even if I'm not an expert in <a href="http://en.wikipedia.org/wiki/Molecular_gastronomyv">molecular gastronomy</a>, I'm often very impressed by transformations of form, color or texture that happen when I cook. It's so nice the mutations of those <i><a href="http://en.wikipedia.org/wiki/Choux_pastry">choux</a> au <a href="http://en.wikipedia.org/wiki/Comt%C3%A9_%28cheese%29">Comté</a></i> in the oven, or how beaten egg whites can turn into crisp meringues on my radiator. Don't worry, I won't talk about chemistry and how those reactions work, but just <b>to what extent it can work</b>.<br /><br /><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/MeringuesRadiateur.jpg" alt="" border="0" />Recipes are so accurate: you get a list of ingredients with exact quantities, and how to use them, but no warranty on what happens if you don't exactly respect the quantities. That's why I will define a tool to represent <span style="font-weight: bold;">ingredient quantity robustness</span> in a recipe in this post: the <span style="font-weight: bold;">ingredient <a href="http://en.wikipedia.org/wiki/Polyhedron">polyhedron</a></span>. And a method to compute it from many receipes of the meal you want to cook, found on the web for example. My example will be <span style="font-weight: bold;">crêpes</span>, our French flat pancakes.<br /><br />This dessert is done with roughly <span style="font-weight: bold;">3 ingredients</span> (and of course butter for the pan, but we will just speak about ingredients of the batter), we will thus get a very nice 3D picture. So: eggs, flour, milk, those ones appear in all of the 19 recipes I've gathered in this <a href="http://philippe.gambette.free.fr/Blog/200802Recette/Recipes.ods">OpenOffice spreadsheet file</a> thanks to the following websites: <a href="http://www.lejus.com/">lejus.com</a>, <a href="http://recettes.1001delices.net/">1001delices.net</a>, <a href="http://www.recette-crepe.net/">recette-crepe.net</a>, <a href="http://www.goosto.fr/">goosto.fr</a>, <a href="http://www.supertoinette.com/">supertoinette.com</a>, <a href="http://www.recettes.qc.ca/">recettes.qc.ca</a> and the French reference <a href="http://www.marmiton.org/">Marmiton</a> <span style="font-size:78%;">(sorry for my <a href="http://allrecipes.com/Recipe/Vegan-Crepes/Detail.aspx">vegan friends</a>)</span>. But maybe I'll just start with 2 ingredients to show how the whole thing works. Say we have already decided the number of eggs to use, one for example. We then compute according to all recipes, with a <a href="http://en.wikipedia.org/wiki/Rule_of_three_%28mathematics%29">rule of three</a>, <span style="font-weight: bold;">the quantity </span><i style="font-weight: bold;">x</i><span style="font-weight: bold;"> of milk and </span><i style="font-weight: bold;">y</i><span style="font-weight: bold;"> of flour</span> that have to be added (I translated everything to grams for simplicity). Those coordinates can then be plotted on a graph, to get the following dots:<br /><br /><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/CrepesPolygonIngredients.gif" alt="" border="0" />From the lower left to the upper right, the number of eggs in the recipes decreases (as there is more and more flour and milk). On the upper left corner we have lots of flour, and on the lower right corner, more milk. And what is this kind of <span style="font-weight: bold;">elastic band</span> which sticks around the dots? It's some kind of safety area: <span style="font-weight: bold;">any point within this area should correspond to ingredient quantities that works for the recipe</span>. Well, at least that's what I hope: <b>any point of the segment between any two points that work should work too</b>, send me your counterexamples if you do not agree. Anyway, this area is called <a style="font-weight: bold;" href="http://en.wikipedia.org/wiki/Convex_hull">the convex hull</a> of the point set, and there are <a href="http://www.chrisharrison.net/projects/convexHull/index.html">many algorithms</a> to compute it automatically. So of course to avoid taking risks you may want to target the middle of the convex hull. Notice that 3 recipes with the same main ingredient quantities correspond to a quite central dot (half a liter of milk and 250 grams of flour for 3 eggs).<br /><br />The convex hull also shows the <span style="font-weight: bold;">robustness of the recipe according to each parameter</span>, that is how accurate you have to be when measuring each ingredient. Note how narrow the convex hull is (it would be even more if I had chosen the same vertical and horizontal scale). This means that depending on the recipe, the quantity of eggs may vary a lot, but not the proportion milk/flour. We can plot for each recipe the difference between the ratio of two ingredients, and the average ratio for those two ingredients:<br /><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/IngredientProportions.png" alt="" border="0" /><br />If you do the average of the <a href="http://en.wikipedia.org/wiki/Absolute_value">absolute value</a> of those deviations, you get: 16% for the milk/flour ratio, 28% for flour/eggs, 31% for milk/eggs. <span style="font-weight: bold;">The milk/flour ratio varies much less </span>than the other ratios among the recipes, so you have to be more careful in this proportion than when choosing the number of eggs. So we have just illustrated and proven this nice theorem: <span style="font-weight: bold;">the recipe of crepes is pretty robust to the variation of the number of eggs</span>.<br /><br />You can also have fun by showing many desserts which share the same main ingredients on the same graph:<br /><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/IngredientPolygonCrepesWafflesFlan.png" alt="" border="0" />Well, just wait before pouring your "pâte à crêpes" onto waffle iron: you may want to add some<a href="http://lonestar.texas.net/%7Efitch/recipies/waffles.html"> baking powder and vegetable oil</a>...<br /><br />To conclude, let's take a look at the 3D ingredient polyhedron <a href="http://www.cse.unsw.edu.au/%7Elambert/java/3d/hull.html">thanks to this very nice applet by Tim Lambert</a> <span style="font-size:78%;">(he even shares the <a href="http://www.cse.unsw.edu.au/%7Elambert/java/3d/implementation.html">source</a> that I was able to modify to include my crepes points)</span>, you can use the mouse to control and move it:<br /><applet codebase="http://philippe.gambette.free.fr/Blog/200802Recette/ConvexHullApplet/" code="AppletHull.class" archive="3d.zip" height="450" width="550"><br /><param name="bgcolor" value="ffffff"><br />Sorry, but you need Java to see the animation.<br /></applet><br /><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/200802Recette/Crepe.jpg" alt="" border="0" />Here again what wa see is a <b>convex hull</b>, in 3 dimensions, on dots (<i>x</i>,<i>y</i>,<i>z</i>) where <i>x</i> is the number of eggs, <i>y</i> the quantity of milk and <i>z</i> the quantity of flour. I put the dots by choosing a minimum limit and a maximum one on the number of eggs to get this <a href="http://en.wikipedia.org/wiki/Frustum">frustum</a>, such that any cut perpendicular to the <i>x</i> axis (for a constant number of eggs) gives exactly the same convex hull polygon as above. To really use it we should let the user enter the quantity of ingredients he just used: if the dot gets inside the polyhedron, no problem, otherwise... you may try the restaurant tonight!Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0tag:blogger.com,1999:blog-5371783716552436400.post-71714658639265448702008-03-24T23:10:00.000+01:002008-03-24T23:15:09.323+01:00Reverse engineering Google Trends (2): margin of errorAs I wrote in <a href="http://veronising.blogspot.com/2008/03/reverse-engineering-google-trends-1.html">my last post</a>, today I'll be quite technical with the <span style="font-weight: bold;">margin of error</span> of my computation. And some considerations to try to mimimize this error in the end of my post. Last time on <i>Veronising</i>: I chose <b>a hierarchy of terms which have higher and higher Google Trends curves, to evaluate, by a sequence of rules of three, the frequence of Google searches for the highest term compared to the least looked for</b>.<br /><br />For the computation I intuitively chose the words such that for each pair of consecutive ones, the first one had a peak approximately twice as high as the second one. The margin of error when reading the value of the curves is approximately 1 pixel, but this absolute error is not the same relative error on the higher, and the lower curve. The higher one always peaks at 113 pixels: 1 pixel of error is less than 1% here. However if the lower one peaks at 50 pixels, it will be a 2% error. If the curve is never over 3 pixels, then the error is more than 30%! So do we have to choose a hierarchy of curves very close to each other? Not necessarily, because in this case we may <b>indeed reduce the error at each step of the computation, but we increase the number of steps (thus, the number of errors) between the least and the most seached</b>.<br /><br />I couldn't help but mathematically modeling this delicate balance that I've just expressed in a sentence. I call <i>a</i> the ration between the max of the highest and the lower curbe within a pair of consecutive ones (thus <i>a</i>>1). To simplify the problem, I consider that this ratio is <b>constant</b> in my whole scale of words. Then, ideally, I would like to find a word 1 searched <i>x</i> times a day on Google, a word 2 searched <i>ax</i> times, a word 3 searched <i>a</i><sup>2</sup><i>x</i> times... a word <i>n</i>+1 searched <i>a<sup>n</sup>x</i> times.<br /><br />Now, let's compute the error: instead of reading a height of <i>k</i> pixels for a word, and <i>ak</i>=113 for the next one, say I make an error of 1 pixel, each time too high <span style="font-size:78%;">(this is a pessimistic assumption, actually the error probably alternates, once one reads too high, once too low...)</span>. In my computation, without error with the rule of 3 I should find as the number of searches for the highest term:<br /><div style="text-align: center;"><i>x</i>.113/<i>k</i> = <i>x</i>.<i>ak</i>/<i>k</i> = <i>xa</i><br /></div><br />The problem is my 1 pixel error, so when I apply the rule of 3 I get in fact:<br /><div style="text-align: center;"><i>x</i>.113/(<i>k</i>+1) = <i>x</i>.113/(113/<i>a</i>+1) = <i>x</i>.113<i>a</i>/(113+<i>a</i>)<br /></div><br />Thus at each step I multiply by 113<i>a</i>/(113+<i>a</i>) instead of multiplying by <i>a</i>, so for the most searched word, I find <i>x</i>(113<i>a</i>/(113+<i>a</i>))<sup><i>n</i></sup> instead of <i>xa<sup>n</sup></i>. I underestimate the real value, so to minimize the error I must find the <i>a</i>>1 that maximizes <i>x</i>(113<i>a</i>/(113+<i>a</i>))<sup><i>n</i></sup>.<br /><br />Second part of the computation: the number of steps, that is <i>n</i>+1 words, of course... but this <i>n</i> depends on <i>a</i>. Indeed we consider that the least searched word (<i>x</i> times) and the most searched one (<i>x</i>'=<i>xa<sup>n</sup></i> times) are fixed. Then <i>x</i>'=<i>xe</i><sup><i>n</i> ln <i>a</i></sup>, so ln(<i>x</i>'/<i>x</i>)=<i>n</i> ln <i>a</i> and finally <i>n</i>=ln(<i>x</i>'/<i>x</i>)/ln <i>a</i>.<br /><br />We put this into the upper formula, so we underestimated all the words of the hierarchy, and the highest was evaluated to:<br /><div style="text-align: center;"><i style="font-weight: bold;">x</i><span style="font-weight: bold;">(113</span><i style="font-weight: bold;">a</i><span style="font-weight: bold;">/(113+</span><i style="font-weight: bold;">a</i><span style="font-weight: bold;">))</span><sup style="font-weight: bold;">ln(<i>x</i>'/<i>x</i>)/ln <i>a</i></sup><br /></div><br /><span style="font-weight: bold;">which we now have to maximize according to <i style="font-weight: bold;">a</i></span>. A quick analysis of this function at its limits shows that it tends to 0 in 1<sup>+</sup>, and to 1 in +∞. Very well, it expresses the dilemma I was mentioning in the 2nd paragraph. However it doesn't give us where the max is reached, and neither <a href="http://ahmed.youssef.free.fr/">Ahmed the Pysicist</a>, nor <a href="http://www.iecn.u-nancy.fr/Le-Laboratoire-Et-La-Recherche/Le-Personnel/">Julian the Mathématician</a>, helped respectively with Mathematica and Maple, could give me a nice formula, there are still some ugly <a href="http://en.wikipedia.org/wiki/Root_%28mathematics%29">RootOf</a>(...) in the formula.<br /><br />No problem, we'll just find an <span style="font-weight: bold;">approximation using Open Office Spreadsheet</span>. The file is <a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsError.ods">there</a>, and here is the curve obtained for a ration of 20,000 between the most searched and the least searched word <span style="font-size:78%;">(the figure approximately corresponds to what I found for my hierarchy)</span>:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsError.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsError.png" alt="" border="0" /></a>So the minimal error is reached for <span style="font-weight: bold;"><i>a</i> approximately equal to 2.75</span> <span style="font-size:85%;">(i.e. a maximal height of 41 pixels for the lower curve)</span>. Then it's less than 25%. Of course it seems a lot, but remember the remark on how pessimistic this scenario was, with errors cumulating by successive underestimations. I still have this interesting theoretical question: <span style="font-weight: bold;">is it possible to compute the expectancy of the error on the computed value of the most searched word, if at each step the error randomly varies between -1 and +1 pixel?</span> ?<br /><br />One can also notice the curve increases a bit faster on the left than on the right. As shown in green on the graph, it seems that <span style="font-weight: bold;">we'd better choose a hierarchy such that consecutive reference words have a number of searches ratio of 4 rather than a ration of 2</span>.<br /><br />Now, here are some other hints to improve the accuracy of the computation. First, measure accuracy: instead of just measuring the maximum, where we know there is an inevitable error, we can try to compute it from measures with less errors. I come back to my example from <a href="http://veronising.blogspot.com/2008/03/reverse-engineering-google-trends-1.html">the previous post</a> with cat, dog, and phone:<br /><i>Comparison cat </i><i>~</i><i> dog</i> (curve 1) : 65 px <i>~</i> 113 px<br /><i>Comparison dog ~ phone</i> (curve 2) : 69 px <i>~</i> 113 px<br /><br />Except that instead of measuring the maximum of dog, we can evaluate it the following way: do the average of the values of the curve 1 for dog, the average of the values of the curve 2 for dog. Then deduce a very accurate scale change. Finally multiply the maximum of dog on the curve 1 (that is exactly 113 pixels, no error here) by this scale change!<br /><br />Another problem now: how to obtain the average of the values of a Google Trends curve? With the <a href="http://freecorp.free.fr/FRA/programmesdivers.htm#CaptuCourbe">CaptuCourbe</a>, of course! Be careful here: some values may not be retrieved by the CaptuCourbe <span style="font-size:78%;">(color problem, for example the curve is cut by a vertical black line hanging from a Google News label bubble)</span>. So you have to compute the average of the curves on values you really managed to retrieve!<br /><br />One more thing, the CaptuCourbe <span style="font-weight: bold;">is not very accurate</span> because it <span style="font-weight: bold;">gets the values of all pixels of some color from a curve, and computes the average, for each column of retrieved values</span>. I've developed a new version, online soon, which allows to get the <span style="font-weight: bold;">maximum of the heights of pixels of some color</span>. I'm using this functionality in my method to compute the max, however it's still the average choice I make to get the average of the curves. This is not a small detail, as you can see on the <a href="http://www.google.fr/trends?q=britney+spears">Britney Spears Google Trends curve</a>, that I extracted in both ways:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/ErrorCaptuCourbe.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/ErrorCaptuCourbe.png" alt="" border="0" /></a>A 20% error in the measure of many peaks using the pixels of the same color is really something!<br /><br />So, to close this series of posts on the vertical scale of Google Trends, I still have some questions left. First, get a "value of the foo" in the number of daily searches. Then I could try to program the whole chain of curve retrieval, measures, and computations, as described in my first post, to provide a utility which would add the vertical scale to a Google Trends curve. Anyway don't expect too much, I'd better wait and see whether the API Google is preparing will provide this data.<br /><br />Estimating the number of searches for some keyword is still a nice challenge., I've discovered <a href="http://www.benjaminmaker.com/gtrends-made-easy/">GTrends Made Easy</a>, a freeware which gives some estimations computed with a method similar to mine here <span style="font-size:78%;">(in fact he does only 1 rule of three, comparing the request word with a reference word for which he knows the number of Google searches, approx 500 ; that is words which appear between 5 and 50000 times a day, that is less than 100 foo)</span>, which was described on this <a href="http://www.youtube.com/watch?v=jcN2WrLIaXY&feature=related">YouTube video</a>.<br /><br /><br /><span style="font-style: italic;">This post was originally published in French: </span><a style="font-style: italic;" href="http://gambette.blogspot.com/2008/03/rtroingnirie-de-google-trends-2-marge.html">Rétroingéniérie de Google Trends (2) : marge d'erreur</a><span style="font-style: italic;">.</span>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com1tag:blogger.com,1999:blog-5371783716552436400.post-65641291749546211032008-03-10T19:34:00.000+01:002008-03-10T19:39:51.918+01:00Reverse engineering Google Trends (1)Last December I started to create a simple program to retrieve the values of a curve from a picture the <a href="http://freecorp.free.fr/programmes/CaptuCourbe.exe">CaptuCourbe</a>, which is still not translated in English, but has <a href="http://freecorp.free.fr/captucourbe/TutorielCaptuCourbeENG.pdf">an English tutorial</a>. One of the possible use of this free software is <b>retrieving and comparing <a href="http://www.google.com/trends">Google Trends curves</a></b>. Except Google Trends curves have a major problem: <b>the vertical scale is not hidden</b>! On top of that there is no zooming tool, so we can't directly compare curves of drastically different heights. The maximum height of a curve is always 113 pixels, so you won't be able to know if a word has been searched 1000 or 10.000 more than another.<br /><br />Here is a <span style="font-weight: bold;">hierarchy of English words, in a decreasing order considering their Google searches according to Google Trends</span> : of, free, sex, car, dog, gun, muscle, knife, torn, filming, separating, fooling.<br /><br />They can be used to create <span style="font-weight: bold;">a scale for Google Trends</span>. It may not be very accurate, but would still be useful to get quantitative values. To compute it, I <i>google-trended</i> pairs of successive words in the hierarchy above. This gives me the <span style="font-weight: bold;">scale change</span> for each pair, by measuring the height (in pixels) of the maximum of each curve. Here is a picture to explain what I mean:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsScale.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsScale.png" alt="" border="0" /></a><br />As I do that for successive words, I get values like this:<br /><i>Comparison cat </i><i>~</i><i> dog</i> : 65 px <i>~</i> 113 px<br /><i>Comparison dog ~ phone</i> : 69 px <i>~</i> 113 px<br />thus I can deduce by a subtle use of the <a href="http://www.blogger.com/en.wikipedia.org/wiki/Rule_of_three">rule of three</a>:<br /><i>cat </i><i>~</i><i> dog </i><i>~</i><i> phone</i> : 65 <i>~</i> 113 <i>~</i> 113*113/69=185,06<br />considering the scale of the first line or:<br /><i>cat </i><i>~</i><i> dog </i><i>~</i><i> phone</i> : 69*65/113=39,69 <i>~</i> 69 <i>~</i> 113<br />with the scale of the second one.<br /><br />I did this computation for all 11 words to get the following maximum values, where I defined the reference as the maximum of <span style="font-style: italic;">fooling</span>. Of course, I call this new unit the <a href="http://www.blogger.com/en.wikipedia.org/wiki/Foobar">foo</a>:<br /><ul><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/01FoolingSeparating.png">fooling</a> : 1 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/01FoolingSeparating.png">separating</a> : 2,5 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/02SeparatingFilming.png">filming</a> : 6,3 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/03FilmingTorn.png">torn</a> : 18 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/04TornKnife.png">knife </a>: 58 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/05KnifeMuscle.png">muscle </a>: 120 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/06MuscleGun.png">gun </a>: 240 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/07GunDog.png">dog </a>: 640 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/08DogCar.png">car </a>: 1500 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/09CarSex.png">sex </a>: 3200 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/10SexFree.png">free </a>: 6600 foo</li><br /><li><a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/BetterScale/11FreeOf.png">of</a> : 16500 foo</li></ul>Be careful, what you have to remember is not only <b>those different values</b>, but also <b>the position of the maximum which reaches those values</b>, that's why each word above links to a picture of the curve to localize its maximum value. Indeed if you want to determine the value of a peak for a new word, either you understood this rule of three principle and then you can have fun computing it directly, or you just use the CaptuCourbe, with the reference curve whose max is just above the peak you want to compute:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/ManaudouCarDog.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/ManaudouCarDog.png" alt="" border="0" /></a>For example here about <span style="font-weight: bold;">800 foo for <a href="http://www.google.fr/trends?q=manaudou&ctab=0&geo=all&date=all&sort=0">Manaudou</a></span> in December 2007, to compare with the <span style="font-weight: bold;">240 foo of </span><a style="font-weight: bold;" href="http://www.google.fr/trends?q=bruni%2Cgun&ctab=0&geo=all&date=all&sort=0">the Bruni peak</a>, the <span style="font-weight: bold;">470 foo reached by </span><a style="font-weight: bold;" href="http://www.google.fr/trends?q=obama%2Cdog&ctab=0&geo=all&date=all&sort=0">Obama</a>, the <span style="font-weight: bold;">1000 foo of </span><a style="font-weight: bold;" href="http://www.google.fr/trends?q=britney+spears%2Ccar&ctab=0&geo=all&date=all&sort=0">Britney</a> the <span style="font-weight: bold;">3200 foo of </span><a style="font-weight: bold;" href="http://www.google.fr/trends?q=tsunami%2Csex&ctab=0&geo=all&date=all&sort=0">the tsunami de 2004</a> and the <span style="font-weight: bold;">5700 foo of... <a href="http://www.google.fr/trends?q=jackson%2Cfree&ctab=0&geo=all&date=all&sort=0">Janet Jackson after Superbowl 2004</a></span>!<br /><br />Now, let's get to the bad news:<br />- the <span style="font-weight: bold;">error</span> propagated by applying 10 times the rule of three will be <span style="font-weight: bold;">the topic of my next post</span>, quite technical <span style="font-size:78%;">(there will even be a pretty nice equation that neither Maple nor Mathematica can simplify)</span>... just consider that the numbers above must be accurate +/- 10%.<br />- <span style="font-weight: bold;">the Google Trends curves vary a lot</span> <span style="font-size:78%;">(maybe it's just a discretization problem, but in this case it's quite strange that the Google News discretization below is the same)</span>, as you can see on this animated gif <span style="font-size:78%;">(created with the <a href="http://www.01net.com/telecharger/windows/Multimedia/creation_graphique/fiches/34462.html">great and simple UnFreez</a>)</span> :<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsChanges.gif"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsChanges.gif" alt="" border="0" /></a>So be careful if you use one of those reference words: you have to remember the value of the peak, its position, but you may also want to superimpose the reference curve that I linked to the word, to check that the reference curve in the picture you're using has its max at the same place, and has the same scale. Try to correct it if it's not the case.<br />- the scale remains relative, to get an absolute one the question would be: <span style="font-weight: bold;">how many Google requests is 1 foo?</span> After my post in French, I got some pretty good comments on this idea, I sum them up here. First we have to be careful that the curves don't show the number of searches, but just <span style="font-weight: bold;">the proportion of searches for a word among all searches in some period of time</span>. This would explain why the Janet Jackson buzz was so high, it's difficult to compare the number of searches corresponding to 5700 foo in 2004 to 800 foo today. Anyway it's still possible to get an idea of the proportion from the number of searches, by trying to find data on the evolution of the number of Google searches in the past years, this must exist on the web <span style="font-size:78%;">(Alexa, at least...)</span>. Let's be more accurate about these values: on the 2004-2008 pictures, as I said I have no idea how the discretization is made, however <span style="font-weight: bold;">on the yearly or monthly pictures, it's quite clear that we find, respectively, the weekly and the daily numbers</span>. So what I'm looking for right now is, for some word, the number of searches it corresponds to. Elandrael <a href="http://gambette.blogspot.com/2008/03/rtroingnirie-de-google-trends.html#c1687629109184571917">had the brilliant idea</a> of using <span style="font-weight: bold;">Google Adwords stats</span>, to get at least a lower bound on this number. For the moment I only got one Google Adword to apply this idea, which would show that a one foo peak corresponds to more than 20000 searches in a month, that is more than 4000 searches if we look at the weekly value in a yearly curve. So of course <span style="font-weight: bold;">I would love to get some other statistics like this</span> to confront the data, just <a href="http://www.lirmm.fr/%7Egambette/PersoContactENG.php">contact me</a> privately if you don't want to write on this blog the Adword you're paying for and its stats. On the same principle, you can also contact me if you have the stats for a common word in which your website appears as the first Google Answer.<br /><br /><br /><span style="font-style: italic;">This post was originally published in French: <a href="http://gambette.blogspot.com/2008/03/rtroingnirie-de-google-trends.html">Rétroingéniérie de Google Trends</a></span>.<br /><br /><span style="font-size:78%;">Source files: the Google Trends curves of each word are linked above, here is the <a href="http://philippe.gambette.free.fr/Blog/GoogleTrendsScale/GoogleTrendsAccurate.ods">spreadsheet file that I used to compute the values in foo</a> (it's quite a mess though, more details to understand it in my next post).</span>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com1tag:blogger.com,1999:blog-5371783716552436400.post-3384214684217129272008-03-02T10:56:00.001+01:002008-03-02T11:00:29.690+01:00The birth of a buzz, liveI've been dreaming for some time to follow on the web the birth of a buzz, and evaluate the reaction of the tools dedicated to their analysis and detection. I would have prefered better circumstances, but I was able to do it for the tragedy of the Northern Illinois University shooting two weeks ago.<br /><br />The name of the gunman <a href="http://answers.yahoo.com/question/index?qid=20080214205048AAHy7LK">was not published on the evening of the drama</a>. But 10 hours later, the Chicago Tribune <a href="http://www.chicagotribune.com/news/local/chi-shooterfeb15,0,2581284.story">provided<br />enough elements to guess it on their website</a>. Of course they wrote quite hypocritically: <blockquote>The Tribune is not naming the gunman because police have not officially completed the identification of his body.</blockquote>A simple search for articles co-signed by <a href="http://scholar.google.com/scholar?q=authornbsp%3Aj-thomas+%22self+injury%22+prison">Jim Thomas with the keywords "self-injury" and "prison"</a> would identify the suspect: Steve Kazmierczak. At 8:10GMT, a visitor of the Wikipedia <a href="http://en.wikipedia.org/w/index.php?title=Northern_Illinois_University_shooting&oldid=191605629">modifies the article about the shooting</a> to add that name. 30 minutes later, <a href="http://dclies.blogspot.com/2008/02/steve-kazmierczak.html">a first blog post cites it</a>, its author updates it many times to add other info found on the net. The name appears then in a <a href="http://onsan.livejournal.com/207435.html">live journal</a> and on a forum, and at 10:33 is cited by the <a href="http://www.dailymail.co.uk/pages/live/articles/news/worldnews.html?in_article_id=514549&in_page_id=1770&ICO=NEWS&ICL=TOPART">Daily Mail</a> (<span style="font-size:78%;">the article has been update since</span>). Then people start to google it a lot, and it reaches <span style="font-weight: bold;">the top of the </span><a style="font-weight: bold;" href="http://www.google.fr/trends">"hot trends" list</a>. It's then cited by <span style="font-weight: bold;">some <a href="http://fr.wikipedia.org/wiki/Splog">splogs</a>, which apparently make money by citing those trends</span> sometimes with extracts of web pages about it, retrieved automatically. At 14:42, the Associated Press announces that the police gave the name: Steven Kazmierczak. I stopped following the buzz there, as articles or webpages about it then used "Steve", "Steven" or "Stephen".<br /><br /><map name="MAP1"><area shape="RECT" coords="337,213,396,257" href="http://hosted.ap.org/dynamic/stories/A/APNEWSALERT?SITE=IADES&SECTION=HOME&TEMPLATE=DEFAULT" target="_self"><area shape="RECT" coords="323,181,492,213" href="http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2008/02/15/wshoot615.xml" target="_self"><area shape="RECT" coords="148,41,282,114" href="http://bigdekalb.com/modules/newbbex/viewtopic.php?topic_id=13&forum=5" target="_self"><area shape="RECT" coords="282,32,364,70" href="http://onsan.livejournal.com/207435.html" target="_self"><area shape="RECT" coords="282,70,399,114" href="http://www.dailymail.co.uk/pages/live/articles/news/worldnews.html?in_article_id=514549&in_page_id=1770&ICO=NEWS&ICL=TOPART" target="_self"><area shape="RECT" coords="483,87,684,105" href="http://www.google.fr/trends/hottrends?q=stephen+kazmierczak&date=2008-2-15&sa=X" target="_self"><area shape="RECT" coords="483,67,681,87" href="http://www.google.fr/trends/hottrends?q=steven+kazmierczak&date=2008-2-15&sa=X" target="_self"><area shape="RECT" coords="483,47,676,67" href="http://www.google.fr/trends/hottrends?q=steve+kazmierczak&date=2008-2-15&sa=X" target="_self"><area shape="RECT" coords="148,140,244,178" href="http://en.wikipedia.org/w/index.php?title=Northern_Illinois_University_shooting&oldid=191605629" target="_self"><area shape="RECT" coords="63,178,224,222" href="http://www.chicagotribune.com/news/local/chi-shooterfeb15,0,2581284.story" target="_self"><area shape="RECT" coords="171,105,256,140" href="http://dclies.blogspot.com/2008/02/steve-kazmierczak.html" target="_self"></map><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/SteveKazmierczak/SteveRecapEngMini.png" alt="" usemap="#MAP1" border="0" />Anyway, following the first hours gave me the opportunity to see how fast the web reacted. As I mentioned it, the <span style="font-weight: bold;">Wikipedia</span> first gave the name. Once more we can wonder about the ethics of the project, and note that it has become THE place to find the latest scoops. See how reactive it is to the death of a celebrity? You can even use the <a href="http://www.wikirage.com/">Wikirage</a> tool, which put on top of its hot article list on February 13: <a href="http://fr.wikipedia.org/wiki/Henri_Salvador">Henri Salvador</a>, <a href="http://fr.wikirage.com/wiki/Imad_Mougniyah/">Imad Mougniyah</a>, et <a href="http://fr.wikipedia.org/wiki/Badri_Patarkatsishvili">Badri Patarkatsishvili</a>.<br /><br />About the blogosphere tools, one can notice that <a style="font-weight: bold;" href="http://www.blogpulse.com/">BlogPulse</a><span style="font-weight: bold;"> isn't very responsive</span>. Of course <a href="http://blogsearch.google.com/">Google Blogsearch</a> detects quite fast the first blogpost on the topic, in a Blogpost blog... However Blogsearch and <a href="http://www.technorati.com">Technorati</a> seem to have a similar efficiency: the Technorati curve is a bit higher after 2PM because of some splogs, which Google Blogsearch didn't display (on purpose, i.e. better splog detection?).<br /><br />The reaction of <span style="font-weight: bold;">search engines</span> on the "Steve Kazmierczak" request is also quite interesting. <span style="font-weight: bold;">They don't detect the buzz in the first hours, except Google</span>. Even if it's not very clear on the graph, the number of relevant results increases from 61 at 10:30 to 68 at 4PM (and those new pages deal indeed with the gunman). But this contrasts with the big rise of the total number of results, which reinforces <a href="http://aixtal.blogspot.com/2005/01/web-googles-counts-faked.html">the mystery on Google numbers</a>. Did the number of pages for this request really double in 5 hours, or is it just a suspicious approximation?<br /><br />But the most important may be in the Google Trends curves. Before the press dared to give the gunman's name, before Wikipedia learned of it, before the tabloids found out, <span style="font-weight: bold;">Google knew</span>, with the first searches on the name less than 3 hours after the shooting. Their leadership over other search engines also gives them a direct access to information, and their tools are ready to treat this as much as they can. With geolocation in particular, to determine the origin of requests, and maybe identify <a href="http://www.google.fr/trends?q=steven,steve&ctab=0&geo=US&geor=usa.il&date=2008-2&sort=0">a local buzz</a>. So when will Google launch a press agency or a tabloid, to uncover scoops and rumors hours before the Daily Mail? And who can access those Google Trends data live today? On the website, the curves are currently updated after 48 hours, not available for words not searched enough, the horizontal scale is not explained (I'm guessing that the 4AM dot represents the number of searches from 3AM to 4AM but I may be wrong), without even mentioning the lack of a vertical scale! A <a href="http://www.news.com/8301-10784_3-9828916-7.html?part=rss&subj=news&tag=2547-1_3-0-5">Google Trends API</a> may give the possibility to access this data, and give back to the <a href="http://aixtal.blogspot.com/2005/01/web-googles-counts-faked.html">internauts</a> the knowledge learned from them.<br /><br /><br /><span style="font-style: italic;">This post was originally published in French: <a href="http://gambette.blogspot.com/2008/03/suivi-en-direct-de-la-naissance-dun.html">Suivi en direct de la naissance d'un buzz</a></span>.<br /><span style="font-size:78%;"><a href="http://philippe.gambette.free.fr/Blog/SteveKazmierczak/STEVE%20KAZMIERCZAK.ods">The data I gathered and used for the graphs (OpenOffice Spreadsheet file)</a></span>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0tag:blogger.com,1999:blog-5371783716552436400.post-14989200320677352802008-01-27T16:53:00.000+01:002008-02-04T15:15:18.042+01:00Danger: deadly hobbies!I'm not familiar with the American blogosphere (I hope blogging in English will help discovering it), but there is a blog there I often visit, <a href="http://en.wikipedia.org/wiki/Xkcd">xkcd</a>, full of witty or funny comics... for a somewhat restricted audience (I mean as geek as their author, <a href="http://en.wikipedia.org/wiki/Randall_Munroe">Randall Munroe</a>, who also created some other <a href="http://xkcd.com/kite/kite_trick.jpg">pretty</a> <a href="http://thefunniest.info/">nice</a> <a href="http://blag.xkcd.com/2007/12/31/ghost/">stuff</a>).<br /><br />I especially enjoyed one of his latest drawings which uses <span style="font-weight: bold;">Google result numbers</span>, as I've already done for <a href="http://gambette.blogspot.com/2006/06/googlefight-pour-lorthographe-le.html">spelling</a>, <a href="http://gambette.blogspot.com/2007/02/stats-de-popularit-artisanales.html">congressmen celebrity</a>, or the <a href="http://gambette.blogspot.com/2006/11/la-naissance-du-web-daprs-les-moteurs.html">birthdate of the web</a> :<br /><div style="text-align: center;"><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://xkcd.com/369/"><img style="margin: 0px auto 10px; display: block; cursor: pointer; text-align: center;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/Xkcd369.png" border="0" /></a></div>This picture created a <a href="http://slashdot.org/article.pl?sid=08/01/12/1312258&threshold=-1">slashdotted</a> <a href="http://mrcopilot.blogspot.com/2008/01/died-in-blogging-accident.html">Google Bomb</a> as the number of Google answers for "died in a blogging accident" exploded. Of course lots of bloggers felt very concerned and cited the picture while adding results of their own Google searches on the same principle. <a href="http://www.aidansean.com/died_in_a.html">That website</a>, and the <a href="http://forums.xkcd.com/viewtopic.php?f=7&t=17053&st=0&sk=t&sd=a">xkcd forum</a> show numerous attempts to find unusual dangerous activities.<br /><br />But couldn't we just submit Google a list of all English verbs, and let it tell us which one creates most deadly accidents? Of course, here comes the method I used, then the results.<br /><br />Step 1, <b>retrieving a list of all English verbs</b>. Quite painful, as you can see in these <a href="http://answers.google.com/answers/threadview?id=369191">404-ridden Google Answers</a>, or those <a href="http://www.englishforums.com/English/IsThereAnEnglishVerbsList/nclx/Post.htm">5 pages of outdated or useless answers in a forum</a>... I decided to trust my favorite search engine, and sent it a list of all verbs that went through my mind. Too bad, it replied with complete dictionaries, so I had to forbid some noun, <em>hat</em>, and eventually, on page 3 for <a href="http://www.google.fr/search?q=-hat+strike+give+abandon+wipe+rub+search+seek+hang+eat+adjust+draw+conclude+reappear+reconsolidate+create+destroy+dream+cut+put+drive&hl=fr&safe=off">-hat strike give abandon wipe rub search seek hang eat adjust draw conclude reappear reconsolidate create destroy dream cut put drive</a>, I got to a page of the VerbNet project with <a href="http://verbs.colorado.edu/old-framesets-10242006/">more than 3500 files named from verbs</a>. If you have better, just give your link in the comments!<br /><br />Step 2, <b>generating the present participles</b>. Verb+ing ? Yeah, but not exactly, I'm quite proud of the following spreadsheet formula which generates almost always the correct form <span style="font-size:78%;">(to avoid making mistakes of course I split it into many cells, but it's juste so impressive to read it entirely)</span> :<br /><span style="font-size:85%;"><span style="color: rgb(0, 0, 153);font-family:courier new;" >B1=IF(RIGHT(A1;1)="e";=IF(LEFT(RIGHT(A1;2);1)="i";CONCATENATE(LEFT(A1;LEN(A1)-2);"ying");CONCATENATE(LEFT(A1;LEN(A1)-1);"ing"));=IF(OR(RIGHT(A1;1)="d";RIGHT(A1;1)="g";RIGHT(A1;1)="m";RIGHT(A1;1)="n";RIGHT(A1;1)="p";RIGHT(A1;1)="t");=IF(OR(LEFT(RIGHT(A1;2);1)="a";LEFT(RIGHT(A1;2);1)="e";LEFT(RIGHT(A1;2);1)="i";LEFT(RIGHT(A1;2);1)="o";LEFT(RIGHT(A1;2);1)="u");=IF(OR(LEFT(RIGHT(A1;3);1)="a";LEFT(RIGHT(A1;3);1)="e";LEFT(RIGHT(A1;3);1)="i";LEFT(RIGHT(A1;3);1)="o";LEFT(RIGHT(A1;3);1)="u";AND(LEFT(RIGHT(A1;2);1)="e";RIGHT(A1;1)="n"));CONCATENATE(A1;"ing");CONCATENATE(A1;RIGHT(A1;1);"ing"));CONCATENATE(A1;"ing"));CONCATENATE(A1;"ing")))</span></span><br /><br />Ok, right, a little explanation. If the last letter is an "e" then: <ul><li>if the letter before is an "i", I transform into "ying" (<span style="font-size:85%;">die -> dying</span>)</li><li>otherwise, I delete the "e", and add "ing" (<span style="font-size:85%;">love -> loving</span>)</li></ul>otherwise: <ul><li>if the verb ends with "en", I just add "ing" (<span style="font-size:85%;">sharpen -> sharpening</span>)</li><li>otherwise, if the next-to-last letter is a "d", "g", "m", "n", "p", "t", I double it if there is a vowel just before, which is not preceded by a vowel (<span style="font-size:85%;">bid -> bidding, put -> putting, but claim -> claiming, feed -> feeding</span>)</li><li>otherwise I just add "ing" (<span style="font-size:85%;">speak -> speaking</span>)</li></ul>I've created those rules intuitively, apparently to double the final consonant one has to <a href="http://www.englishclub.com/writing/spelling_add-ing.htm">check whether the last syllabus is stressed or not</a>, my version has a tiny number of exceptions, I just identified verbs ending with "on" (<span style="font-size:85%;">abandon -> abandonning, d'oh, even if con -> conning is correct</span>).<br /><a href="http://philippe.gambette.free.fr/Blog/Xkcd/GoogleBlacklist.png"><img style="margin: 0px 0px 10px 10px; float: right; width: 200px;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/GoogleBlacklistMini.png" border="0" /></a> <p>Step 3, <b>put around each participle</b> <em>"died in a</em> on the left <em>(or "died in an</em> if the verb starts with a vowel) and <em>accident"</em> on the right, and <b>send each of those expressions to Google</b>, using my tool <span style="font-size:78%;">(in French, but it's not as if it wasn't super-intuitive)</span> <b><a href="http://freecorp.free.fr/programmes/FuryPopularity.exe">FuryPopularity</a></b>. I've just updated the program, because Google changed the style of its results, and apparently its spam detection is tougher: after 200 requests separated by 5 second intervals, it just blacklisted me, I could search back only after a <a href="http://en.wikipedia.org/wiki/Captcha">captcha</a>. Apparently 10 second intervals are ok. If you know anything about their detection algorithm I'm very interested: is it just about the frequency (if it is, do they have to identify proxys?) ? Do they carefully check the period?<br /><br />Here is the tagcloud of the words which happened to get more than one result:<br /><a href="http://philippe.gambette.free.fr/Blog/Xkcd/GoogleResultsTagCloud.png"><img style="margin: 0px auto 10px; display: block; text-align: center;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/GoogleResultsTagCloud.png" border="0" /></a>If you check the words which do not appear frequently, you unfortunately do not always find contestants for the <a href="http://en.wikipedia.org/wiki/Darwin_Awards"><strong>Darwin Awards</strong></a>. First, some parasite links from reactions about the xkcd picture, or animal deaths, but also some more annoying things: participial adjectives (<span style="font-size:85%;">amusing, embarrassing, interesting...</span>) and verbs which do not express an activity, rather circumstances (<span style="font-size:85%;">exploding, crushing, choking...</span>). For the latter, I have no solution. But it's quite easy to <span style="font-weight: bold;">remove the participial adjectives automatically</span>. Of course you can do it with a <a href="http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/">syntactic parser</a>, or even a dictionary but I prefer to go on with Google result numbers.<br /></p><p><a href="http://philippe.gambette.free.fr/Blog/Xkcd/DetectionAdjectifsVerbauxGoogle.png"><img style="margin: 0px 10px 10px 0px; float: left; width: 150px;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/DetectionAdjectifsVerbauxGoogle.png" border="0" /></a>I made a few tries before finding a nice criterion. Comparing the frequency of the participle form with the infinitive form (hoping it will be greater for participial adjectives) or computing the occurrence percentages of the participle just after "a", "more", or "most". On the graph on the left, <strong>the first 5 verbs give participial adjectives</strong>. We can see that the "a ..." strategy fails, because of the inclusion of participles into nouns: "a frying pan" explains why "a frying" is so frequent. Anyway <strong>"most ..." seems to help making the distinction</strong>:</p><p><a href="http://philippe.gambette.free.fr/Blog/Xkcd/DetectionAdjectifsDanger.png"><img style="margin: 0px auto 10px; display: block; text-align: center;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/DetectionAdjectifsDangerMini.png" border="0" /></a>Once those participial adjectives have been filtered, one can count not only the number of "died of a ... accident", but also "a ... accident", as well as the number of answers for the participle itself to get things like <strong>accident rates (blue) and death rates (red)</strong> :<a href="http://philippe.gambette.free.fr/Blog/Xkcd/AccidentDeathRateGoogle.png"><img style="margin: 0px auto 10px; display: block; text-align: center;" alt="" src="http://philippe.gambette.free.fr/Blog/Xkcd/AccidentDeathRateGoogleMini.png" border="0" /></a>If your hobby is not in the list, at least you have a basis to compare it. If it is, well, be careful, especially if you plan on jousting next weekend!<br /></p><br /><p><em>This post was originally published in French: </em><a href="http://gambette.blogspot.com/2008/01/danger-accidents-mortels.html"><em>Danger : accidents mortels !</em></a><br /><span style="font-size:78%;">As usual, the source files: </span><a href="http://philippe.gambette.free.fr/Blog/Xkcd/Xkcd369.ods"><span style="font-size:78%;">list of more than 3000 English verbs and their computed present participle</span></a><span style="font-size:78%;">, </span><a href="http://philippe.gambette.free.fr/Blog/Xkcd/AdjectifsVerbaux.ods"><span style="font-size:78%;">testing Google detection of participial adjectives</span></a><span style="font-size:78%;">, </span><a href="http://philippe.gambette.free.fr/Blog/Xkcd/GoogleResultsTotal.ods"><span style="font-size:78%;">results of Google requests</span></a><span style="font-size:78%;">.</span></p>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com3tag:blogger.com,1999:blog-5371783716552436400.post-5517400437899253982008-01-17T01:06:00.001+01:002008-01-17T01:52:54.786+01:00Britney-Amy : Celebrity Deathmatch!<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmy.jpg"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmy.jpg" alt="" border="0" /></a>Discovered on French television last week, <a href="http://www.whenwillamywinehousedie.com/">WhenWillAmyWinehouseDie.com</a> and <a href="http://www.whenisbritneygoingtodie.com/">WhenIsBritneyGoingToDie.com</a> provide an interesting challenge: <span style="font-weight: bold;">guess when the two divas will die, the closest wins an iPod Touch, or a PS3</span>. Huge buzz, of course, thousands of people went there to take a chance and leave a pre-condolences message. Both sites are of course optimized to make money with ads (contrary to this more confidential game, which is just as sweet though: the "<a href="http://www.asccm.com/asccm/topmort/votes.html">TopMort</a>", where you can pick up people you think will die within the year, and who have not been ), and only provide raw data entered by people who signed. No stats at all, what a shame.<br /><br />I was very lucky that <a href="http://www.biologie.ens.fr/dyogen/spip.php?article88">Matthieu Muffato</a>, a friend who happens to be an impressive <a href="http://en.wikipedia.org/wiki/Python">Python</a> expert, used a few code lines and some execution hours to retrieve the data and mail it to me.<br /><br />The initial question I had about it was simple: <b>what is the biggest time interval not yet chosen</b>, which would a priori <b>maximize the chance to win</b>? By "a priori", I mean considering any time interval of some fixed length is uniformly dangerous for Amy and Britney, and uniformly chosen by other visitors.<br /><br />Unfortunately, those ideal conditions are far from being true in the real world, for a very simple reason: the visitor <b>wants his iPod or PS3 right now, not in 30 years!</b> So if you wish to target a month that has not yet been chosen, for Britney, you will have to wait for February 2023. For Amy, there has been less voters yet, so if nothing has changed since data was retrieved, november 2016 is still available, or you can try year 2031 as only october was chosen then. I must add, as Matthieu told me, that <b>those websites contain no date after January 2038</b>, probably because of some <b>date coding problem</b>. Now let's move on to more serious stuff, here is an overview of the number of votes per month <span style="font-size:78%;">(with a simple vertical normalization for Amy who received less votes, sorry for the title in French...)</span> :<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyMois.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyMoisMini.png" alt="" border="0" /></a><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyLogLog.png"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyLogLogMini.png" alt="" border="0" /></a>I guess you are as flabbergasted as I was when the curves appeared: <span style="font-weight: bold;">they are almost identical!</span> Correlation coefficient equals 0.98, we get the same <a href="http://en.wikipedia.org/wiki/Power_law">power law</a>! We can check that it is indeed a power law using a log-log dotplot, which also gives us approximately the equation <i>Y</i>=4-3<i>X</i>, in logarithmic coordinates, that is when we go back to linear: <i>y</i> = 10 000 - <i>x</i>^(4/3), which is the equation of the pale blue curve.<br /><br />In fact power laws are everywhere in real data, (especially in <a href="http://en.wikipedia.org/wiki/Scale-free_network">small-world graphs</a> which have a <a href="http://www.nd.edu/%7Enetworks/Publication%20Categories/01%20Review%20Articles/ScaleFree_Scientific%20Ameri%20288,%2060-69%20%282003%29.pdf">power law degree distribution</a>). What is surprising here is that both laws have approximately the same parameters. If we check the details we can notice however that voters prefered 2008 for Britney and 2009 for Amy.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/AmyProportionsMensuelles.png"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/AmyProportionsMensuellesMini.png" alt="" border="0" /></a>By checking the curves carefuly, one also notices some kind of periodicity. At least they are not monotone, and I've put on the left a representation of the percentage of votes per month, each year from 2008 to 2013, for Miss Winehouse. Variations are quite strange, as august attracts twice as much voters as november! I don't have any explanation for those smaller choices of Novembre, December and February, it may be a mechanism similar to what Knuth describes in one of the first exercises of <a href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">Volume 2</a>: ask a friend (or an enemy) a random digit, he will more probably say 7.<br /><br />Here is the representation of<span style="font-weight: bold;"> the choices per day, for any year</span>. I've removed January 1 which was artificially big (due to the year coding problem, which gave a lot of 01/01/1970).<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyJour.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyJour.png" alt="" border="0" /></a>We can observe a new surprising periodicity phenomenon: voters prefer <span style="font-weight: bold;">the middle of the month</span>. Note also the vicious voters who chose <span style="font-weight: bold;">February 14</span>, poor Britney! Even the dot of her birthday, December 2, is quite high compared to its neighbors...<br /><br />So to forget about those sad things, let's end with <a href="http://www.dailymotion.com/video/x16ddv_britney-spears-everytime_family">emotion and poetry</a>, here are the <span style="font-weight: bold;">pre-condolences tag clouds</span> (made with <a href="http://freecorp.free.fr/FRA/programmesdivers.htm">Freecorp TagCloud Builder</a>) for both stars.<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyTagClouds.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyTagClouds.png" alt="" border="0" /></a><span style="font-size:100%;"><br />This post originally appeared in French: <a href="http://gambette.blogspot.com/2008/01/britney-amy-duel-mortel.html">Britney-Amy, duel mortel</a>.</span><br /><span style="font-size:78%;">Vote spreadhseet file <a href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyDays.ods">by day</a>, <a href="http://philippe.gambette.free.fr/Blog/BritneyAmy/BritneyAmyMonths.ods">by month</a>, <a href="http://www.lirmm.fr/%7Egambette/PersoContact.php">contact me</a> if you would like to get other source files.<br /></span>Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0tag:blogger.com,1999:blog-5371783716552436400.post-29867415672257599472008-01-16T20:07:00.001+01:002008-03-25T23:54:18.579+01:00What does veronising mean?Well, to get some idea of what <b style="font-style: italic;">veronising</b> is, maybe you should check <a href="http://feeds.feedburner.com/aixtal-en">Jean Veronis's blog</a>. My definition would be "<span style="font-weight: bold; font-style: italic;">to design and publish on a blog programs or methods able to help analyzing data</span>". <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://aixtal.blogspot.com/2007/01/sarko-sur-un-nuage.html"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px;" src="http://sites.univ-provence.fr/veronis/blog-images/nuage-sarko-2007-01-14.png" alt="" border="0" /></a>Jean has created a whole bunch of useful tools, which work mainly on texts (he is a researcher in natural language processing) or internet corpuses (search engines results for example). Among the most impressive, the <a href="http://aixtal.blogspot.com/2006/01/outil-le-nbuloscope.html"><span style="font-style: italic;">Nébuloscope</span></a>, which makes tag clouds out of words appearing frequently in the results of a search engine request, or the <span style="font-style: italic;">Chronologue</span>, which used to draw the evolution of a keyword use on the internet (it used the "date" function of a search engin which has now disappeared).<br /><br />Inspired by his impressive results, I've started to analyze data I find interesting myself, and program some little tools to help me do that. I may translate some of my previous posts, here are some topics I've worked on, I put the links to French posts until they are translated to English.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://gambette.blogspot.com/2008/01/tag-cloud-tag-tree-nuage-arbor-2-les.html"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 250px;" src="http://philippe.gambette.free.fr/Blog/Voeux/Voeux2008TreeCloud.png" alt="" border="0" /></a><i style="font-weight: bold;">Phylogenetic trees</i> are used to represent the evolution of species, based on the idea that some species close to each other will appear in a same subtree, and a lot of algorithms exist to build them from biology data. But phylogenetic trees are also an excellent mean of visualizing data, and I've tried building the <span style="font-weight: bold;">trees of country votes at the </span><a style="font-weight: bold;" href="http://gambette.blogspot.com/2006/05/eurovision-et-gopolitique.html">Eurovision song contest</a>, <a href="http://gambette.blogspot.com/2007/01/arbre-phylogntique-des-dputs.html"><span style="font-weight: bold;">French "députés"</span> (our congressmen) according to their proximity of votes</a> (as well as a <a href="http://gambette.blogspot.com/2007/02/la-puce-adn-des-dputs.html"><span style="font-weight: bold;">DNA chip visualization</span> of those votes</a>), and more recently I've been working on building what I call a <a style="font-weight: bold;" href="http://gambette.blogspot.com/2008/01/tag-cloud-tag-tree-nuage-arbor-2-les.html">"tree cloud" from a text</a>, the same idea than a tag cloud except the order of the words is not alphabetical, but they are displayed as leaves of a tree. Until the program is finished, I still rely on <a href="http://gambette.blogspot.com/2006/10/nuages-de-mots-artisanaux.html"><span style="font-weight: bold;">tag clouds</span> (with nice colors and a logarithmic scale</a>, pleaaase, not those ugly and unexpressive ones we often find on the internet !). I've tried using them to analyze one's writing style (with <a style="font-weight: bold;" href="http://gambette.blogspot.com/2006/08/connais-toi-toi-mme.html">instant messaging logs</a>) or speaking style (with <a style="font-weight: bold;" href="http://gambette.blogspot.com/2008/01/sarkozy-lorateur-2-dcryptage-de-limpro.html">the planned version and the pronounced version of a press conference talk by President Sarkozy</a>).<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://gambette.blogspot.com/2007/02/la-puce-adn-des-dputs.html"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://philippe.gambette.free.fr/Blog/VotesDeputesMicroarray.bmp" alt="" border="0" /></a><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://gambette.blogspot.com/2006/11/la-naissance-du-web-daprs-les-moteurs.html"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px;" src="http://philippe.gambette.free.fr/Blog/AnneesMoteurs.jpg" alt="" border="0" /></a>I like doing some <span style="font-weight: bold;">search engine statistics</span>, to <a href="http://gambette.blogspot.com/2006/09/taguer-tagguer-ou-tagger.html">help spelling</a>, <a href="http://gambette.blogspot.com/2006/11/la-naissance-du-web-daprs-les-moteurs.html">visualize and <span style="font-weight: bold;">date the birth of the web</span></a>, or <a href="http://gambette.blogspot.com/2007/02/stats-de-popularit-artisanales.html">send massive requests to <span style="font-weight: bold;">compare popularity of people or concepts</span></a>. Those stats analyzes often make critical use of spreadsheet programs, which also helped me to <a style="font-weight: bold;" href="http://gambette.blogspot.com/2007/10/dissection-dune-ptition-1.html">track the evolution of a petition</a>, which gave me a glance on <a href="http://gambette.blogspot.com/2007/10/dissection-dune-ptition-2-quelle-heure.html">the time of the day people <span style="font-weight: bold;">connect to the internet depending on their job</span> (students, teachers, engineers...)</a>. <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://gambette.blogspot.com/2006/10/mcdonalds-macdo-mac-donalds-et-vorono.html"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 150px;" src="http://philippe.gambette.free.fr/Blog/VoronoiMcDo.jpg" alt="" border="0" /></a>I could also get <a href="http://gambette.blogspot.com/2007/05/bilan-des-sondages-de-2002.html">nice <span style="font-weight: bold;">synthesis pictures of French polls</span> before the first round of the presidential election, in 2002</a> and <a href="http://gambette.blogspot.com/2007/04/bilan-des-sondages-du-premier-tour.html">2007</a>. I'm very interested in informative and original visualizations, like <span style="font-weight: bold;">Voronoi diagrams</span> (for <a href="http://gambette.blogspot.com/2006/10/mcdonalds-macdo-mac-donalds-et-vorono.html">McDonald's restaurants in Paris</a>) or <span style="font-weight: bold;">metro map views</span> (<a href="http://gambette.blogspot.com/2007/02/visualisationmtro-est-gi-complet.html">building them from a genuine metro map is a GI-complete problem</a>).<br /><br />I have also analyzed a <span style="font-weight: bold;">blog meme</span> last year, the "<a style="font-weight: bold;" href="http://moblogsmoproblems.blogspot.com/2006/12/revenge-of-z-lister.html">Z-list</a>", which in France appeared as <a href="http://gambette.blogspot.com/2007/03/analyse-du-buzz-f-list-de-la-blogosphre.html">"la F-list"</a>. <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://gambette.blogspot.com/2008/01/comparer-les-courbes-de-buzz-avec-le.html"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px;" src="http://philippe.gambette.free.fr/Blog/CaptuCourbe/CaptucourbesBruniManaudou.png" alt="" border="0" /></a>Even if I did not publish my data on the "Z-list", I still have the files, as well as <a href="http://philippe.gambette.free.fr/AUTOGRAPH/Gambette%20-%20Z%20list,%20F%20list,%20diffusion%20d%27un%20meme%20sur%20la%20blogosphere.pdf">the "infection tree"</a>, on my computer somewhere. This year I've created a little utility, the "<a style="font-weight: bold;" href="http://gambette.blogspot.com/2008/01/comparer-les-courbes-de-buzz-avec-le.html">CaptuCourbe</a>", to <span style="font-weight: bold;">put data from the picture of a curve into a spreadsheet file</span> (some "unscan" programs do this but they are quite complicated to use, or expensive), which helps <span style="font-weight: bold;">comparing the evolution of a buzz on many buzz tracking systems</span> (Google Trends, Technorati, site stats systems...). Currently the program is in French only, but <a href="http://aixtal.blogspot.com/2008/01/tool-analyzing-buzz-with-captucourbes.html">Jean motivated me to translate it to English</a>, which will soon be done.<br /><br />And you will never guess the topic of my most visited blog post, which I'm not the most proud of: I had noticed a bug on some French TV channel website which gave access to the channel live on the internet. It lasted about 3 days, but since then Google sends me all people who want to <a href="http://gambette.blogspot.com/2007/03/regarder-m6-en-direct-sur-internet.html">watch "M6" on the web</a>. I've put links to other French channels which can be viewed free anyway, to avoid frustration.<br /><br />See you soon for some new computer-powered experimentations!Philippehttp://www.blogger.com/profile/17811557333070553722noreply@blogger.com0