I ran into a weird problem today, with a client who built a new website on top of an old one. Of course, the standard SEO advice is to keep all your filenames the same, as much as possible, and put 301 redirects when you have to move content. But this client just wanted a new site, and didn't care about (and didn't really have) any meaningful rankings to begin with.
So the old site was deleted and the new site built. The new site had all the magical SEO sparkles, but after a month it wasn't doing as well as we thought it should be. MSN and Yahoo were okay, but Google was lagging. Why?
I checked in Google (site: command) and saw that all the old pages were still listed along with the new. So to Google, this site was about 90% 404 errors and invalid content. Not looking so hot. I recommended to the consultant working on this client that he set up 301 redirects on all the old pages to the index of the new site, just to clean them out of Big G's index.
Problem solved! I'm a genius.
At least, until the consultant and the programmer working with him came back and asked the shocking question: how do you get Apache to serve all 404 error pages as 301 redirects to a new index page?
It had seemed so elegant a solution, I never worried much about the implementation. But now it was back, haunting me.
The easy way, of course, is to set up something in the htaccess file like the example TamingtheBeast.net lists:
redirect 301 /old/old.htm http://www.you.com/new.htm
But we had something like 1000 old URLs that weren't in a simple set of directories or anything, and they all had long querystrings and -- I would have had to create 1000 separate entries in the file, and I couldn't even find all the old names. Not an option.
Some poking around quickly revealed that nobody had really done this before, for this purpose. I found some sites that actually talked about serving 404s with redirects, but that didn't include the 301 I needed for search engines. And I found a couple places that talked about using mod_rewrite with a 301, but that didn't include the pointer back to the index! (So Google would see that the page had moved permanently, to exactly where it already was, but now with duplicate content from the index -- which would get the site blacklisted. Not ideal.)
As I thought about this, I also realized that we didn't really want every 404 page to 301 back to the index. That can get really confusing for people. So I started looking at the missing pages, to see if there were some consistent attribute about them that I could use to mod_rewrite before throwing a 404 error.
Fortunately, the old site had been written in ColdFusion. Every old page had a ".cfm" in it, and the new site was all php. So I wrote a mod_rewrite in the htaccess file that would use regular expressions to find any ".cfm" and load up a new php script I wrote instead. The php script sent a standard 301 redirect with the index target.
Here's the files I wrote, mostly for my own future reference:
====================
.htaccess modification:
====================
RewriteEngine On
RewriteRule \.cfm http://www.NEWSITE.com/cleaner.php
====================
cleaner.php
====================
Header( "HTTP/1.1 301 Moved Permanently" ); Header( "Location: http://www.NEWSITE.com" );
How cool is that? I'm a genius.
Comments (2)
At the Utah PHP Users group last night, Mac Newbold talked about using the 404 error document for this very purpose, among others. You can put "ErrorDocument 404 /cleaner.php" in .htaccess and check for the .cfm extension in the PHP code (using $_SERVER['request_uri']). Then send the appropriate headers with header(). Since you'd then be in PHP instead of htaccess, you could send yourself an email to know which URL was being requested or who the referrer was. This might also be easier for some people than writing regular expressions, and most hosts support ErrorDocument even if they don't support RewriteRule.
Posted by Richard K Miller | February 17, 2006 6:26 PM
Posted on February 17, 2006 18:26
Richard -- Thanks for the post. A classic example of the value of networking that I ought to be participating in more. :o)
I saw instructions for doing this with the 404 when I was searching, but I wasn't sure if the PHP would actually overwrite the server-level 404 header setting. I guess it does! If I'd known that, I'd probably have gone this route. Next time!
Of course, learning regular expressions has its own benefits, too. And sending a response email (highly undesirable in this particular case) would also be available through the cleaner.php I ended up using.
Posted by Tom | February 20, 2006 4:13 PM
Posted on February 20, 2006 16:13