Veronising...: freeware

Showing posts with label freeware. Show all posts

Monday, March 24, 2008

Reverse engineering Google Trends (2): margin of error

As I wrote in my last post, today I'll be quite technical with the margin of error of my computation. And some considerations to try to mimimize this error in the end of my post. Last time on Veronising: I chose a hierarchy of terms which have higher and higher Google Trends curves, to evaluate, by a sequence of rules of three, the frequence of Google searches for the highest term compared to the least looked for.

For the computation I intuitively chose the words such that for each pair of consecutive ones, the first one had a peak approximately twice as high as the second one. The margin of error when reading the value of the curves is approximately 1 pixel, but this absolute error is not the same relative error on the higher, and the lower curve. The higher one always peaks at 113 pixels: 1 pixel of error is less than 1% here. However if the lower one peaks at 50 pixels, it will be a 2% error. If the curve is never over 3 pixels, then the error is more than 30%! So do we have to choose a hierarchy of curves very close to each other? Not necessarily, because in this case we may indeed reduce the error at each step of the computation, but we increase the number of steps (thus, the number of errors) between the least and the most seached.

I couldn't help but mathematically modeling this delicate balance that I've just expressed in a sentence. I call a the ration between the max of the highest and the lower curbe within a pair of consecutive ones (thus a>1). To simplify the problem, I consider that this ratio is constant in my whole scale of words. Then, ideally, I would like to find a word 1 searched x times a day on Google, a word 2 searched ax times, a word 3 searched a²x times... a word n+1 searched aⁿx times.

Now, let's compute the error: instead of reading a height of k pixels for a word, and ak=113 for the next one, say I make an error of 1 pixel, each time too high (this is a pessimistic assumption, actually the error probably alternates, once one reads too high, once too low...). In my computation, without error with the rule of 3 I should find as the number of searches for the highest term:

x.113/k = x.ak/k = xa

The problem is my 1 pixel error, so when I apply the rule of 3 I get in fact:

x.113/(k+1) = x.113/(113/a+1) = x.113a/(113+a)

Thus at each step I multiply by 113a/(113+a) instead of multiplying by a, so for the most searched word, I find x(113a/(113+a))ⁿ instead of xaⁿ. I underestimate the real value, so to minimize the error I must find the a>1 that maximizes x(113a/(113+a))ⁿ.

Second part of the computation: the number of steps, that is n+1 words, of course... but this n depends on a. Indeed we consider that the least searched word (x times) and the most searched one (x'=xaⁿ times) are fixed. Then x'=xe^{n ln a}, so ln(x'/x)=n ln a and finally n=ln(x'/x)/ln a.

We put this into the upper formula, so we underestimated all the words of the hierarchy, and the highest was evaluated to:

x(113a/(113+a))^{ln(x'/x)/ln a}

which we now have to maximize according to a. A quick analysis of this function at its limits shows that it tends to 0 in 1⁺, and to 1 in +∞. Very well, it expresses the dilemma I was mentioning in the 2nd paragraph. However it doesn't give us where the max is reached, and neither Ahmed the Pysicist, nor Julian the Mathématician, helped respectively with Mathematica and Maple, could give me a nice formula, there are still some ugly RootOf(...) in the formula.

No problem, we'll just find an approximation using Open Office Spreadsheet. The file is there, and here is the curve obtained for a ration of 20,000 between the most searched and the least searched word (the figure approximately corresponds to what I found for my hierarchy):

So the minimal error is reached for a approximately equal to 2.75 (i.e. a maximal height of 41 pixels for the lower curve). Then it's less than 25%. Of course it seems a lot, but remember the remark on how pessimistic this scenario was, with errors cumulating by successive underestimations. I still have this interesting theoretical question: is it possible to compute the expectancy of the error on the computed value of the most searched word, if at each step the error randomly varies between -1 and +1 pixel? ?

One can also notice the curve increases a bit faster on the left than on the right. As shown in green on the graph, it seems that we'd better choose a hierarchy such that consecutive reference words have a number of searches ratio of 4 rather than a ration of 2.

Now, here are some other hints to improve the accuracy of the computation. First, measure accuracy: instead of just measuring the maximum, where we know there is an inevitable error, we can try to compute it from measures with less errors. I come back to my example from the previous post with cat, dog, and phone:
Comparison cat ~ dog (curve 1) : 65 px ~ 113 px
Comparison dog ~ phone (curve 2) : 69 px ~ 113 px

Except that instead of measuring the maximum of dog, we can evaluate it the following way: do the average of the values of the curve 1 for dog, the average of the values of the curve 2 for dog. Then deduce a very accurate scale change. Finally multiply the maximum of dog on the curve 1 (that is exactly 113 pixels, no error here) by this scale change!

Another problem now: how to obtain the average of the values of a Google Trends curve? With the CaptuCourbe, of course! Be careful here: some values may not be retrieved by the CaptuCourbe (color problem, for example the curve is cut by a vertical black line hanging from a Google News label bubble). So you have to compute the average of the curves on values you really managed to retrieve!

One more thing, the CaptuCourbe is not very accurate because it gets the values of all pixels of some color from a curve, and computes the average, for each column of retrieved values. I've developed a new version, online soon, which allows to get the maximum of the heights of pixels of some color. I'm using this functionality in my method to compute the max, however it's still the average choice I make to get the average of the curves. This is not a small detail, as you can see on the Britney Spears Google Trends curve, that I extracted in both ways:

A 20% error in the measure of many peaks using the pixels of the same color is really something!

So, to close this series of posts on the vertical scale of Google Trends, I still have some questions left. First, get a "value of the foo" in the number of daily searches. Then I could try to program the whole chain of curve retrieval, measures, and computations, as described in my first post, to provide a utility which would add the vertical scale to a Google Trends curve. Anyway don't expect too much, I'd better wait and see whether the API Google is preparing will provide this data.

Estimating the number of searches for some keyword is still a nice challenge., I've discovered GTrends Made Easy, a freeware which gives some estimations computed with a method similar to mine here (in fact he does only 1 rule of three, comparing the request word with a reference word for which he knows the number of Google searches, approx 500 ; that is words which appear between 5 and 50000 times a day, that is less than 100 foo), which was described on this YouTube video.

This post was originally published in French: Rétroingéniérie de Google Trends (2) : marge d'erreur.

Monday, March 10, 2008

Reverse engineering Google Trends (1)

Last December I started to create a simple program to retrieve the values of a curve from a picture the CaptuCourbe, which is still not translated in English, but has an English tutorial. One of the possible use of this free software is retrieving and comparing Google Trends curves. Except Google Trends curves have a major problem: the vertical scale is not hidden! On top of that there is no zooming tool, so we can't directly compare curves of drastically different heights. The maximum height of a curve is always 113 pixels, so you won't be able to know if a word has been searched 1000 or 10.000 more than another.

Here is a hierarchy of English words, in a decreasing order considering their Google searches according to Google Trends : of, free, sex, car, dog, gun, muscle, knife, torn, filming, separating, fooling.

They can be used to create a scale for Google Trends. It may not be very accurate, but would still be useful to get quantitative values. To compute it, I google-trended pairs of successive words in the hierarchy above. This gives me the scale change for each pair, by measuring the height (in pixels) of the maximum of each curve. Here is a picture to explain what I mean:

As I do that for successive words, I get values like this:
Comparison cat ~ dog : 65 px ~ 113 px
Comparison dog ~ phone : 69 px ~ 113 px
thus I can deduce by a subtle use of the rule of three:
cat ~ dog ~ phone : 65 ~ 113 ~ 113*113/69=185,06
considering the scale of the first line or:
cat ~ dog ~ phone : 69*65/113=39,69 ~ 69 ~ 113
with the scale of the second one.

I did this computation for all 11 words to get the following maximum values, where I defined the reference as the maximum of fooling. Of course, I call this new unit the foo:

fooling : 1 foo

separating : 2,5 foo

filming : 6,3 foo

torn : 18 foo

knife : 58 foo

muscle : 120 foo

gun : 240 foo

dog : 640 foo

car : 1500 foo

sex : 3200 foo

free : 6600 foo

of : 16500 foo

Be careful, what you have to remember is not only those different values, but also the position of the maximum which reaches those values, that's why each word above links to a picture of the curve to localize its maximum value. Indeed if you want to determine the value of a peak for a new word, either you understood this rule of three principle and then you can have fun computing it directly, or you just use the CaptuCourbe, with the reference curve whose max is just above the peak you want to compute:

For example here about 800 foo for Manaudou in December 2007, to compare with the 240 foo of the Bruni peak, the 470 foo reached by Obama, the 1000 foo of Britney the 3200 foo of the tsunami de 2004 and the 5700 foo of... Janet Jackson after Superbowl 2004!

Now, let's get to the bad news:
- the error propagated by applying 10 times the rule of three will be the topic of my next post, quite technical (there will even be a pretty nice equation that neither Maple nor Mathematica can simplify)... just consider that the numbers above must be accurate +/- 10%.
- the Google Trends curves vary a lot (maybe it's just a discretization problem, but in this case it's quite strange that the Google News discretization below is the same), as you can see on this animated gif (created with the great and simple UnFreez) :

So be careful if you use one of those reference words: you have to remember the value of the peak, its position, but you may also want to superimpose the reference curve that I linked to the word, to check that the reference curve in the picture you're using has its max at the same place, and has the same scale. Try to correct it if it's not the case.
- the scale remains relative, to get an absolute one the question would be: how many Google requests is 1 foo? After my post in French, I got some pretty good comments on this idea, I sum them up here. First we have to be careful that the curves don't show the number of searches, but just the proportion of searches for a word among all searches in some period of time. This would explain why the Janet Jackson buzz was so high, it's difficult to compare the number of searches corresponding to 5700 foo in 2004 to 800 foo today. Anyway it's still possible to get an idea of the proportion from the number of searches, by trying to find data on the evolution of the number of Google searches in the past years, this must exist on the web (Alexa, at least...). Let's be more accurate about these values: on the 2004-2008 pictures, as I said I have no idea how the discretization is made, however on the yearly or monthly pictures, it's quite clear that we find, respectively, the weekly and the daily numbers. So what I'm looking for right now is, for some word, the number of searches it corresponds to. Elandrael had the brilliant idea of using Google Adwords stats, to get at least a lower bound on this number. For the moment I only got one Google Adword to apply this idea, which would show that a one foo peak corresponds to more than 20000 searches in a month, that is more than 4000 searches if we look at the weekly value in a yearly curve. So of course I would love to get some other statistics like this to confront the data, just contact me privately if you don't want to write on this blog the Adword you're paying for and its stats. On the same principle, you can also contact me if you have the stats for a common word in which your website appears as the first Google Answer.

This post was originally published in French: Rétroingéniérie de Google Trends.

Source files: the Google Trends curves of each word are linked above, here is the spreadsheet file that I used to compute the values in foo (it's quite a mess though, more details to understand it in my next post).

Sunday, January 27, 2008

Danger: deadly hobbies!

I'm not familiar with the American blogosphere (I hope blogging in English will help discovering it), but there is a blog there I often visit, xkcd, full of witty or funny comics... for a somewhat restricted audience (I mean as geek as their author, Randall Munroe, who also created some other pretty nice stuff).

I especially enjoyed one of his latest drawings which uses Google result numbers, as I've already done for spelling, congressmen celebrity, or the birthdate of the web :

This picture created a slashdotted Google Bomb as the number of Google answers for "died in a blogging accident" exploded. Of course lots of bloggers felt very concerned and cited the picture while adding results of their own Google searches on the same principle. That website, and the xkcd forum show numerous attempts to find unusual dangerous activities.

But couldn't we just submit Google a list of all English verbs, and let it tell us which one creates most deadly accidents? Of course, here comes the method I used, then the results.

Step 1, retrieving a list of all English verbs. Quite painful, as you can see in these 404-ridden Google Answers, or those 5 pages of outdated or useless answers in a forum... I decided to trust my favorite search engine, and sent it a list of all verbs that went through my mind. Too bad, it replied with complete dictionaries, so I had to forbid some noun, hat, and eventually, on page 3 for -hat strike give abandon wipe rub search seek hang eat adjust draw conclude reappear reconsolidate create destroy dream cut put drive, I got to a page of the VerbNet project with more than 3500 files named from verbs. If you have better, just give your link in the comments!

Step 2, generating the present participles. Verb+ing ? Yeah, but not exactly, I'm quite proud of the following spreadsheet formula which generates almost always the correct form (to avoid making mistakes of course I split it into many cells, but it's juste so impressive to read it entirely) :
B1=IF(RIGHT(A1;1)="e";=IF(LEFT(RIGHT(A1;2);1)="i";CONCATENATE(LEFT(A1;LEN(A1)-2);"ying");CONCATENATE(LEFT(A1;LEN(A1)-1);"ing"));=IF(OR(RIGHT(A1;1)="d";RIGHT(A1;1)="g";RIGHT(A1;1)="m";RIGHT(A1;1)="n";RIGHT(A1;1)="p";RIGHT(A1;1)="t");=IF(OR(LEFT(RIGHT(A1;2);1)="a";LEFT(RIGHT(A1;2);1)="e";LEFT(RIGHT(A1;2);1)="i";LEFT(RIGHT(A1;2);1)="o";LEFT(RIGHT(A1;2);1)="u");=IF(OR(LEFT(RIGHT(A1;3);1)="a";LEFT(RIGHT(A1;3);1)="e";LEFT(RIGHT(A1;3);1)="i";LEFT(RIGHT(A1;3);1)="o";LEFT(RIGHT(A1;3);1)="u";AND(LEFT(RIGHT(A1;2);1)="e";RIGHT(A1;1)="n"));CONCATENATE(A1;"ing");CONCATENATE(A1;RIGHT(A1;1);"ing"));CONCATENATE(A1;"ing"));CONCATENATE(A1;"ing")))

Ok, right, a little explanation. If the last letter is an "e" then:

if the letter before is an "i", I transform into "ying" (die -> dying)
otherwise, I delete the "e", and add "ing" (love -> loving)

otherwise:

if the verb ends with "en", I just add "ing" (sharpen -> sharpening)
otherwise, if the next-to-last letter is a "d", "g", "m", "n", "p", "t", I double it if there is a vowel just before, which is not preceded by a vowel (bid -> bidding, put -> putting, but claim -> claiming, feed -> feeding)
otherwise I just add "ing" (speak -> speaking)

I've created those rules intuitively, apparently to double the final consonant one has to check whether the last syllabus is stressed or not, my version has a tiny number of exceptions, I just identified verbs ending with "on" (abandon -> abandonning, d'oh, even if con -> conning is correct).

Step 3, put around each participle "died in a on the left (or "died in an if the verb starts with a vowel) and accident" on the right, and send each of those expressions to Google, using my tool (in French, but it's not as if it wasn't super-intuitive) FuryPopularity. I've just updated the program, because Google changed the style of its results, and apparently its spam detection is tougher: after 200 requests separated by 5 second intervals, it just blacklisted me, I could search back only after a captcha. Apparently 10 second intervals are ok. If you know anything about their detection algorithm I'm very interested: is it just about the frequency (if it is, do they have to identify proxys?) ? Do they carefully check the period?

Here is the tagcloud of the words which happened to get more than one result:
If you check the words which do not appear frequently, you unfortunately do not always find contestants for the Darwin Awards. First, some parasite links from reactions about the xkcd picture, or animal deaths, but also some more annoying things: participial adjectives (amusing, embarrassing, interesting...) and verbs which do not express an activity, rather circumstances (exploding, crushing, choking...). For the latter, I have no solution. But it's quite easy to remove the participial adjectives automatically. Of course you can do it with a syntactic parser, or even a dictionary but I prefer to go on with Google result numbers.

I made a few tries before finding a nice criterion. Comparing the frequency of the participle form with the infinitive form (hoping it will be greater for participial adjectives) or computing the occurrence percentages of the participle just after "a", "more", or "most". On the graph on the left, the first 5 verbs give participial adjectives. We can see that the "a ..." strategy fails, because of the inclusion of participles into nouns: "a frying pan" explains why "a frying" is so frequent. Anyway "most ..." seems to help making the distinction:

Once those participial adjectives have been filtered, one can count not only the number of "died of a ... accident", but also "a ... accident", as well as the number of answers for the participle itself to get things like accident rates (blue) and death rates (red) :If your hobby is not in the list, at least you have a basis to compare it. If it is, well, be careful, especially if you plan on jousting next weekend!

This post was originally published in French: Danger : accidents mortels !
As usual, the source files: list of more than 3000 English verbs and their computed present participle, testing Google detection of participial adjectives, results of Google requests.

Wednesday, January 16, 2008

What does veronising mean?

Well, to get some idea of what veronising is, maybe you should check Jean Veronis's blog. My definition would be "to design and publish on a blog programs or methods able to help analyzing data".

Jean has created a whole bunch of useful tools, which work mainly on texts (he is a researcher in natural language processing) or internet corpuses (search engines results for example). Among the most impressive, the Nébuloscope, which makes tag clouds out of words appearing frequently in the results of a search engine request, or the Chronologue, which used to draw the evolution of a keyword use on the internet (it used the "date" function of a search engin which has now disappeared).

Inspired by his impressive results, I've started to analyze data I find interesting myself, and program some little tools to help me do that. I may translate some of my previous posts, here are some topics I've worked on, I put the links to French posts until they are translated to English.

Phylogenetic trees are used to represent the evolution of species, based on the idea that some species close to each other will appear in a same subtree, and a lot of algorithms exist to build them from biology data. But phylogenetic trees are also an excellent mean of visualizing data, and I've tried building the trees of country votes at the Eurovision song contest, French "députés" (our congressmen) according to their proximity of votes (as well as a DNA chip visualization of those votes), and more recently I've been working on building what I call a "tree cloud" from a text, the same idea than a tag cloud except the order of the words is not alphabetical, but they are displayed as leaves of a tree. Until the program is finished, I still rely on tag clouds (with nice colors and a logarithmic scale, pleaaase, not those ugly and unexpressive ones we often find on the internet !). I've tried using them to analyze one's writing style (with instant messaging logs) or speaking style (with the planned version and the pronounced version of a press conference talk by President Sarkozy).

I like doing some search engine statistics, to help spelling, visualize and date the birth of the web, or send massive requests to compare popularity of people or concepts. Those stats analyzes often make critical use of spreadsheet programs, which also helped me to track the evolution of a petition, which gave me a glance on the time of the day people connect to the internet depending on their job (students, teachers, engineers...).

I could also get nice synthesis pictures of French polls before the first round of the presidential election, in 2002 and 2007. I'm very interested in informative and original visualizations, like Voronoi diagrams (for McDonald's restaurants in Paris) or metro map views (building them from a genuine metro map is a GI-complete problem).

I have also analyzed a blog meme last year, the "Z-list", which in France appeared as "la F-list".

Even if I did not publish my data on the "Z-list", I still have the files, as well as the "infection tree", on my computer somewhere. This year I've created a little utility, the "CaptuCourbe", to put data from the picture of a curve into a spreadsheet file (some "unscan" programs do this but they are quite complicated to use, or expensive), which helps comparing the evolution of a buzz on many buzz tracking systems (Google Trends, Technorati, site stats systems...). Currently the program is in French only, but Jean motivated me to translate it to English, which will soon be done.

And you will never guess the topic of my most visited blog post, which I'm not the most proud of: I had noticed a bug on some French TV channel website which gave access to the channel live on the internet. It lasted about 3 days, but since then Google sends me all people who want to watch "M6" on the web. I've put links to other French channels which can be viewed free anyway, to avoid frustration.

See you soon for some new computer-powered experimentations!

Veronising...