« Business Ideas: Music, Steganography | Main | Lord of the Rings: A Literary Analysis of the Film »

Look Like a Search Engine for Fun and Profit

It really bugs me when I see a result in a Search Engine Result Page (SERP) that looks exactly perfect -- then I click it and get redirected to a page telling me to "subscribe and see the full article!"

So I figured out a way to get around it, at least in certain conditions.

I got to thinking, as I sometimes do, and wondered how did the search engine know what article was there, if the link just went to a subscribe page? The search engines aren't signing up for these forums and things.

Fortunately, I'm the director of R&D for an Internet marketing comapny, and figuring these sorts of things out is basically my job. (What a great job!)

It turns out, the websites themselves are checking to see who is visiting them. When a crawler from a search engine hits their site, they send out the full text of the article. But when a human visits, they send instead the subscription or sign up ad page.

There are several different things a website can check to try and determine if you're a search engine or a person. The easiest and most common is called the "user agent" -- a code your browser sends whenever you visit a page.

Essentially, when you type "www.webmasterworld.com" into Internet Explorer, the browser sends a message to that site. The message includes a "user agent" string that says, "I'm Internet Explorer 5.5," which in theory allows the site to send properly formatted content. (Since not all browsers are exactly the same.)

Search engine crawlers send messages like, "I'm Googlebot!"

If you're a happy Firefox user (which you really should be: http://www.firefox.com) fixing this is as easy as pie.

Search for something like "firefox user agent extension" and you'll get a page with the most recent version of a handy tool. Install the extension and you'll have a new drop-down menu that lets you pick which "user agent" you want your browser to send. You can be strictly truthful, and leave it at the default.

Or, more usefully, you can have it tell every site you visit, "I'm Googlebot!"

Then you'll be able to see everything* the search engines are allowed to see.

Happy learning!

===========

For reference:

YahooSeeker/1.0 (compatible; Mozilla 4.0; MSIE 5.5; http://search.yahoo.com/yahooseeker.html)

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

YahooFeedSeeker/1.0 (compatible; Mozilla 4.0; MSIE 5.5; http://my.yahoo.com/s/publishers.html)

Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

======================
*Okay, there's an exception. Another thing websites can check for is the IP address of the machine you're using. That's a lot harder to fake. In fact, I haven't found a way to arbitrarily specify a return IP address. But it seems most sites don't really use IP-based detection yet. So this method is still pretty practical. I'll let you know if that changes.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 16, 2005 7:50 AM.

The previous post in this blog was Business Ideas: Music, Steganography.

The next post in this blog is Lord of the Rings: A Literary Analysis of the Film.

Many more can be found on the main index page or by looking through the archives.