Every so often I get a request along the lines of, "Hey web analyst, run me a broken links report." Although simple enough, this task is not without its nuances. Here are 3 tactics you can use to find broken links on your own site when you (inevitably) get asked:
1) Set forth a spider
You may be the resident expert in whatever web analytics tool your company uses, and when someone asks you about broken links they may expect that you'll use your web analytics tool to find broken links for them. However, it's not your only choice! Spiders are worth considering for this task, as well; they'll identify a slightly different set of broken links than you get from your web analytics tool.
Here's how I like to think about the difference: web analytics tools tell you which URLs visitors actually try and fail to view. Spiders tell you which URLs throw errors regardless of whether any visitor clicks the link or not, and regardless of whether the target URL lives on your site or elsewhere on the internet.
Newbie explanation: spiders are software programs that crawl pages on your site and scan for certain things like broken links. If your site employs a lot of JavaScript and forms and Flash and web 2.0 whiz-bang, you'll need to find a spider that can handle the complexity. There are basic free tools (like the W3C Link Checker) and fancy not-so-free tools (like this one from Accenture) and plenty of options in between. If you haven't done so already, go test-drive a spider.
2) Analyze your logs
Back in the old days, when nearly every web analyst dealt with log files, it was actually quite easy to get a broken link report because it comes standard with log-processing solutions. These days, with so many of us using JavaScript tags alone for data collection, error tracking takes a bit of customization that you must initiate.
If your web analytics tool uses log files, now is the time to locate your broken link report. If your tool uses JavaScript tags, it's worth checking to see if broken link reporting has been set up. If, for whatever reason, you don't have broken link reporting yet, there's still hope! You can still get something useful from your raw logs.
Normally I'd say don't mess with your logs - it's a slippery slope and before you know it you'll find yourself writing your own web analytics application from scratch. Don't go there! However, logs can actually be useful for error reporting, at least as a once-off project. The basic information about a broken link is contained within a single row, so there's no need to sessionalize multiple rows into visits.
Here are a few guidelines for log handling, should you have the need. Since log files are quite large your task will be easier if you pick a fairly short, recent date range for analysis. Next, you (or your sys admin) can filter down the logs so you just get the rows with status code 404. Pop your data into Excel (it fits in Excel, right?) and make sure you can identify what each column contains. The columns you'll be most interested in are requested URL (the broken link itself) and referring URL (the page on which the broken link appeared, if applicable). Finally, make a pivot table and identify the most common broken links and their sources.
3) Have a custom 404 page, and tag it
Allow me to demonstrate how to tell if any site has a custom error page: http://june.typepad.com/foo Oh no, a broken link! Now I know what my 404 page looks like, and by viewing source I can see if it's got a web analytics tag on it (as of this writing it does not). Try this on your own site. It will work like a charm unless you actually have a page called /foo.
If you discover that you haven't got a custom 404 page, it's time to make one - it's simply good form. For inspiration, here's a practical and amusing post on how to create an effective error page.
As an extra step for those who use a tag-based web analytics tool, you will need to make sure your custom error page is properly tagged. The specifics will depend on the tool you use, so go to your solution's document library and read the fabulous manual. In the end you will get error reporting that's just as good as log file solutions, and a lot less effort than parsing raw logs.
Now, after trying out the 3 tactics mentioned here, take all of the problems you've unearthed and submit them as bugs. It will take far longer to fix the errors than it did for you to find them in the first place. Mission accomplished!
Photo credit: Meteorry
Recent Comments