All projects: DIY-IT Project Guide
This project: Migrating a massive legacy CMS system to WordPress
I've recently finished the first production phase of a migration from a very old content management system to a new one, based on WordPress and my own code. When you migrate a very large system (we had about a hundred thousand articles), there are often a lot of loose ends that need cleaning up -- and that can take quite some time.
The other day, working on the ZDNet DIY-IT article , I discovered one of those loose ends.
As it turned out, I had a directory show up in Google that I'd rather not have indexed by the search engine. Now, before you go thinking the keys to the kingdom are in there, that's not the issue. It's just that the directory has many of the files the old CMS used to intake, to generate articles.
Let me illustrate. In the Camtasia article, I wanted to reference a review I did of SnagIt back in 2007 for Connected Photographer. I remembered I called SnagIt insane, for all the features it had (it actually has more now, believe it or not). So I did a search on "insane snagit gewirtz" -- as you can see in the following screenshot.
Notice the first organic result. That's the one we want Google to index and display. But notice that there's another line, the one that begins with ".FLYINGHEAD". That's a .doc file (not a Word file in this instance, but a CMS text file using the .doc extension). Now, there's nothing special about that file, except that it's very old and shouldn't be showing up in the Google index.
Here's what's happening. On our main server, we have one directory that holds our WordPress install. Let's call that "main". I also have another directory, one that mostly has all the old images from the migration -- and all these .doc intake files. Let's call that "legacy".
Both are on the same level, so "main" is a sibling of "legacy". The thing is, I want browsers to be able to fetch files from the legacy directory, because I have tens of thousands of article images in there. I just don't want Google to index the whole thing.
Like I said, it's not about hiding anything, but as you can see from the search results, the Google results just don't look clean. And I'm guessing we're probably losing some SEO juice because there are a lot of extra files online, including ones with similar content to what we're serving on the actual site.
For my site, there's not a lot of harm being done. But if you follow the escapades of my ZDNet colleague Stephen Chapman, our very own search ninja and host of the excellent SEO Whistleblower blog, you'll see how some information shouldn't be exposed. I asked Stephen to share with me a couple of stories, so he provided me the following links:
In my case, I simply needed to plug the leak. I will eventually send a program though all those legacy directories and clear out the old files we no longer need. But that's a relatively time-consuming mere matter of programming, so it can wait.
In the meantime, I've decided to wield the robots.txt sledgehammer. Robots.txt is a file on servers that most (not all) search spiders respect. It tells spiders what folders to explore and what folders to avoid. I used a command that, over time, will effectively delete the entire directory from Google:
It should be noted that placing this file, with these two lines, in the wrong place can effectively nuke you off the Internet. Stephen provided the following disturbing tale as illustration:
When I discussed this article with Stephen (it's cool having access to the smartest minds by virtue of being part of ZDNet), he shared with me a few cautions. First, he wanted you to know that just because robots.txt tells search engines to not index a directory, that doesn't mean the directory is protected from Internet access. It's just a map, not a guard dog.
Second, he warned that robots.txt is, itself, readable by anyone on the Internet. See that line up there that says "Disallow: /". That's a pretty basic robots.txt line. But some people might have a line like:
While the search engines would now not index /site/confidential/financial, by virtue of it being listed in robots.txt, some Web site spelunkers would now more easily be able to find (where they wouldn't have, otherwise) that there's a directory called /site/confidential/financial -- and they may now go digging in to see what they can find.
I also wouldn't put it past a Wikileaks or some other sort of online snooping service from reading robots.txt files from all over the Internet and then publishing specifically what those files listed as disallowed.
So, what's the point of this story? What's my cautionary tale or message? Well, there are a few. First, when you migrate legacy data, you may have a search engine impact. Using robots.txt is one way around it. My second message is that using robots.txt could be like shooting yourself in the foot, so be careful what you do.
Good luck, go forth, and DIY something great!