Posts on May 8, 2007
Posted on 2007-05-08 09:40:00
I couldn't help myself and ordered the Google n-gram set. Don't know when it will arrive, but I'd suspect it will take a while. Maybe 6-8 weeks? :-)
In the meantime I'm working on algorithms and running them on the data set Peter Norvig used for his spelling corrector. Did you know "the" is the most common word? It's true! Tonight I hope to compare the distribution of words to Zipf's Law. (which is a very cool law!)
If you like MST3K, you'll probably like RiffTrax. (NYT article) We should try this sometime!
As I post, the #3 bestselling book on amazon is The Secret, which is about the secret to getting everything you want. Apparently it's (spoiler!) just wanting it enough. Oh, and if you get sick or bad things happen to you, it's your fault for not thinking about good things enough and you should be shunned lest your lack of good thoughts spread! Slate does a nice Human Guinea Pig about it.
No n-grams because of the license
Posted on 2007-05-08 14:13:00
So it turns out that to get the Google n-grams, you have to sign a user license agreement that forbids the more interesting things I'd like to do. Bummer. So I sent them an email:
Firstly, I'd like to thank Google for making this gigantic corpus available to the public. When I first read about it I was excited and immediately thought of a number of neat applications I could do using the data (see my post at http://gregstoll.livejournal.com/114129.html). Unfortunately, it looks like I won't be able to use the data as provided.
The first issue was the price - I would be just doing these projects in my free time, for noncommercial use, but the page at the Linguistic Data Consortium ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13) lists the price for nonmembers as $180 (including a $30 shipping fee). A fee for processing and handling is certainly reasonable, but $180 is a bit steep for 6 DVDs. Admittedly, this price may be set by the LDC itself, as most corpora are at least this expensive, but I had to carefully consider whether it was worth the price for a side project.
Eventually I decided it would be worth it to get my hands on such a great resource, and I even sent in the order to LDC. Unfortunately, I sent in the wrong license agreement, and was dismayed to find the agreement specific to this corpus. (http://www.ldc.upenn.edu/Catalog/mem_agree/Web_1T_5gram_V1_User_Agreement.html) Since I wouldn't be allowed to "publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form", except for "limited excerpts from the Data in articles, reports and other documents describing the results of User’s linguistic education and research", this means that the more interactive ideas I had (a cryptogram solver, algorithm to calculate the probability of a given sentence, generator of English-like text) wouldn't be allowed.
So I'm forced to rethink my plans and try to gather my own corpus from the web, which will be undeniably smaller and less accurate. Of course I understand that Google was under no obligation to provide this data in the first place, but it is a little frustrating to have it so tantalizingly close and yet be unable to use it.
(crossposted to http://gregstoll.livejournal.com/115202.html)
This backup was done by LJBackup.