« Thesis Ideas 2: Fraudulent Websites | Main | Fascinating Lies: Poorly-Written Articles »

Word Frequency

I need to know how often each word is used. I'd like to answer the question, "How big is the vocabulary of the Internet?" And I'd like the database that I'd generate in answering that, to help answer a host of other interesting questions. (Related domain names and spelling suggestions come to mind immediately.)

So, would this work?

Link a crawler to a database that stores the words found on each page crawled. Feed the crawler a seed set of pages, make sure it hits the Gutenberg project and some AP archives and DMOZ. Increment the count of each word as we run across it again.

If there are 750,000 words (as suggested by the Oxford Dictionary people), I should be able to store this in something like 23 megs. As long as I cap individual word frequencies at the four billion limit imposed by a default MySQL integer type. (Of course that could be increased.)

I could use content-type strings to identify languages, couldn't I? Or is that too unreliable? Would I end up indexing Japanese and French pages right along with it?

And actually... That would be *cool*. With a way to store space-data with each word, I wonder if with some simple language seeding I could even automatically generate language-specific indexes...?

What is a language? What defines a language?

Wow wow. I can't believe Google hasn't come out with more stuff than they have. With all that data -- I wonder.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on October 3, 2005 4:06 PM.

The previous post in this blog was Thesis Ideas 2: Fraudulent Websites.

The next post in this blog is Fascinating Lies: Poorly-Written Articles.

Many more can be found on the main index page or by looking through the archives.