Tag ngrams (5)
vi, ngrams, hats
Posted on 2007-05-18 09:42:00
Tags: vi ngrams
I found this great article on why vi is so awesome yesterday, which includes some handy tips that I didn't know. (my vi-fu is not particularly strong, although I do use it a lot) On the same site there's a wonderful graphical cheat sheet that I now have at my fingertips. (literally!)
Here's the first little applications using the Wikipedia n-grams - a simple interface to count the number of times a word shows up. Problems: it doesn't work in IE (sigh...I'll fix it soonish) and there's no progress indicator, so wait a minute after clicking "Submit". Edit: also, there are some crazy results: "zi" (1841) is more common than "carrie" (1361). Another sigh.
So there's this neat hat problem that's been making the rounds. n people (who are allowed to discuss strategy beforehand) each have either a red or blue hat put on (so you can see the other n-1 hats but not your own). Then, without communicating and simultaneously, each has to guess what color their hat is, or decline to guess. If at least one person guesses and all guesses are right, they all win; otherwise, they all lose. Here's an article that describes the problem and the optimal solution for 3 people. Apparently there are very good solutions for 2^n-1 people (3, 7, 15, ...), but the solution for 7 people, say, is way less elegant than the one for 3.
A coworker and I were talking about this problem yesterday and how to find the optimal solution. For the n person problem, a strategy for one person can be described in 2^(n-1) characters of "R" (guess red), "B" (guess blue), "N" (don't guess), since this covers all the cases of the hats that person can see. So, a complete strategy is 2^(n-1)*n such characters. It's pretty clear how to evaluate what percentage of the time a strategy will work (just try all 2^n possibilities), so if the number of strategies (3^(2^(n-1)*n)) is small enough you could just try everything. For the 7 people case this works out to 5.6*10^213 strategies - clearly too many! But you could use a genetic algorithm to breed solutions together based on their fitness.
So anyway, I may take a break from n-grams to code up some stuff relating to this.
obsession #12358: bsg
Posted on 2007-05-14 12:53:00
So destroyerj gave us many seasons of Battlestar Galactica as a present, and we sat down kinda late Saturday night to watch the miniseries. (i.e. Disc 1 of Season 1) I had heard from a few people that the miniseries was kinda slow since it had to introduce all the characters and such. To the contrary, I enjoyed it a lot, and we stayed up until 1:30 or so to finish it. Good stuff, although I liked the first "real" episode less than the miniseries. My mom's coming this week (she has a conference in Columbia...what are the odds?) but I guess we'll pick it up next week...
As one obsession waxes, so does another wane. My licensing problem with the Google n-grams led me to try to find my own. I found a corpus based on Wikipedia that I'm currently using, but the parser I wrote to extract words isn't very good, and I'm not sure that it can be improved due to inconsistent data. So my data is not particularly clean. (sometimes two words are strung together, sometimes a single word is broken into two) I wrote a quick little thingy to just do a simple lookup on the popularity of a word, and I'll probably make a web interface for that, but I'm highly doubtful that a cryptogram solver would actually work. (and I've basically given up with trying to extract 2-grams, 3-grams, etc. that would generate English-looking text)
So I'm really losing motivation to work on it, and not having an exciting project to work on leaves me in a weird and unstable condition. I like having a drive to do things like this, but sometimes it's kinda irritating. Maybe a little time off from the n-grams will give me some more inspiration...or maybe I'll give up completely and move on to something else. In a very real way it shouldn't matter (it's just something I'm doing for fun, not for anyone in particular), but I still feel bad starting a project, then giving up and moving on to another one a week later.
No n-grams because of the license
Posted on 2007-05-08 14:13:00
So it turns out that to get the Google n-grams, you have to sign a user license agreement that forbids the more interesting things I'd like to do. Bummer. So I sent them an email:
Firstly, I'd like to thank Google for making this gigantic corpus available to the public. When I first read about it I was excited and immediately thought of a number of neat applications I could do using the data (see my post at http://gregstoll.livejournal.com/114129.html). Unfortunately, it looks like I won't be able to use the data as provided.
The first issue was the price - I would be just doing these projects in my free time, for noncommercial use, but the page at the Linguistic Data Consortium ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13) lists the price for nonmembers as $180 (including a $30 shipping fee). A fee for processing and handling is certainly reasonable, but $180 is a bit steep for 6 DVDs. Admittedly, this price may be set by the LDC itself, as most corpora are at least this expensive, but I had to carefully consider whether it was worth the price for a side project.
Eventually I decided it would be worth it to get my hands on such a great resource, and I even sent in the order to LDC. Unfortunately, I sent in the wrong license agreement, and was dismayed to find the agreement specific to this corpus. (http://www.ldc.upenn.edu/Catalog/mem_agree/Web_1T_5gram_V1_User_Agreement.html) Since I wouldn't be allowed to "publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form", except for "limited excerpts from the Data in articles, reports and other documents describing the results of User’s linguistic education and research", this means that the more interactive ideas I had (a cryptogram solver, algorithm to calculate the probability of a given sentence, generator of English-like text) wouldn't be allowed.
So I'm forced to rethink my plans and try to gather my own corpus from the web, which will be undeniably smaller and less accurate. Of course I understand that Google was under no obligation to provide this data in the first place, but it is a little frustrating to have it so tantalizingly close and yet be unable to use it.
(crossposted to http://gregstoll.livejournal.com/115202.html)
Posted on 2007-05-08 09:40:00
I couldn't help myself and ordered the Google n-gram set. Don't know when it will arrive, but I'd suspect it will take a while. Maybe 6-8 weeks? :-)
In the meantime I'm working on algorithms and running them on the data set Peter Norvig used for his spelling corrector. Did you know "the" is the most common word? It's true! Tonight I hope to compare the distribution of words to Zipf's Law. (which is a very cool law!)
If you like MST3K, you'll probably like RiffTrax. (NYT article) We should try this sometime!
As I post, the #3 bestselling book on amazon is The Secret, which is about the secret to getting everything you want. Apparently it's (spoiler!) just wanting it enough. Oh, and if you get sick or bad things happen to you, it's your fault for not thinking about good things enough and you should be shunned lest your lack of good thoughts spread! Slate does a nice Human Guinea Pig about it.
I love data
Posted on 2007-04-25 16:05:00
Tags: programming ngrams
Thanks to Peter Norvig's (he wrote my AI textbook! and is director of research at Google!!) article about writing a spelling corrector, I was reminded that Google released a giiiiant list of n-grams found on the web. Unfortunately, it's only for noncommercial use unless you join and pay thousands of dollars (noncommercial is OK) and costs $180(!) to buy and ship. On the other hand, it's 6 DVDs of compressed data (24 GBs of gzipped files). This is soooo tempting.
This backup was done by LJBackup.