Sitemap URL Redirecting

I recently encountered an issue where a client with a web server cluster was having difficulty keeping every node’s copy of their sitemap up-to-date, and wanted to know if all the search engines’ sitemap bots would follow a 301 redirect on the sitemap itself to a master server with a different domain than the site for which the sitemap is authored.

I doubted that we could trust that redirects on sitemaps would necessarily be followed by every service which supports them. I asked Google’s Matt Cutts about it, and he offered the following suggestions:

Suggestion #1: Write a script in your favorite server-side scripting language which itself requests the master sitemap URL directly and returns it as output.

Example sitemap.php script (PHP 4+):

<?php readfile(’http://master.server/real-sitemap.xml’); ?>

In PHP, readfile() not only reads the specified file or URL, it automatically outputs it in one step. So if you put this script in each server’s document root, then all you have to do is provide the various Webmaster consoles with http://yoursite.com/sitemap.php for a sitemap URL and it will take care of the rest rather elegantly.

Suggestion #2: Use cross-submit sitemap files.

This is a very useful solution to a lot of problems, though it only solves the problem for Google, and requires creating a site in the Google Webmaster Tools console
for each possible subdomain (or cluster node, I suppose), which will each need their own verification files placed in the server’s document root.

If you’re talking about a lot of sites (and thus a lot of unique verification files), one possible solution may be to wildcard redirect Google verification file requests so literally any google*.html will return a status 200, which is all the Google Sitemaps verification looks for. Sound like a cool idea? Think a little harder…

Imagine you are an unscrupulous practitioner of corporate espionage who wants to access a competitor’s Google Webmaster Tools console for a variety of creative reasons ranging from competitive intelligence gathering (such as their true Google inlink count) to outright sabotage through de-listing a few key URL’s for which they outrank you. Note: Please do not do this and if you do, it wasn’t my fault!

All you would have to do is look for people who have followed the advice of Reaper-X linked above and wildcard-spoofed their own Google verification URL’s, which is easly accomplished by trying a few random Google Webmaster Tools verification URLs on their domain and, once you find one that never returns 404 (always 200), you have found your mark. Then it’s a simple matter of setting their site up in your Webmaster console, immediately verifying ownership, and you’re in like Flynn.

I suppose if you really think you need to implement this hack for your own site, you could reduce (but not eliminate) your risk of this sort of attack by adding a rewrite condition which first checks the User-Agent for Googlebot, and perhaps Googlebot IP addresses as well, but any marginally savvy attacker could write an automated detection script which could properly spoof these HTTP headers, so at best this solution still only provides security through obscurity, and that ain’t no kinda security a’tall.

Tags: , ,


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button