Every so often I get a request along the lines of, "Hey web analyst, run me a broken links report." Although simple enough, this task is not without its nuances. Here are 3 tactics you can use to find broken links on your own site when you (inevitably) get asked:
1) Set forth a spider
You may be the resident expert in whatever web analytics tool your company uses, and when someone asks you about broken links they may expect that you'll use your web analytics tool to find broken links for them. However, it's not your only choice! Spiders are worth considering for this task, as well; they'll identify a slightly different set of broken links than you get from your web analytics tool.
Here's how I like to think about the difference: web analytics tools tell you which URLs visitors actually try and fail to view. Spiders tell you which URLs throw errors regardless of whether any visitor clicks the link or not, and regardless of whether the target URL lives on your site or elsewhere on the internet.
2) Analyze your logs
Normally I'd say don't mess with your logs - it's a slippery slope and before you know it you'll find yourself writing your own web analytics application from scratch. Don't go there! However, logs can actually be useful for error reporting, at least as a once-off project. The basic information about a broken link is contained within a single row, so there's no need to sessionalize multiple rows into visits.
Here are a few guidelines for log handling, should you have the need. Since log files are quite large your task will be easier if you pick a fairly short, recent date range for analysis. Next, you (or your sys admin) can filter down the logs so you just get the rows with status code 404. Pop your data into Excel (it fits in Excel, right?) and make sure you can identify what each column contains. The columns you'll be most interested in are requested URL (the broken link itself) and referring URL (the page on which the broken link appeared, if applicable). Finally, make a pivot table and identify the most common broken links and their sources.
3) Have a custom 404 page, and tag it
Allow me to demonstrate how to tell if any site has a custom error page: http://june.typepad.com/foo Oh no, a broken link! Now I know what my 404 page looks like, and by viewing source I can see if it's got a web analytics tag on it (as of this writing it does not). Try this on your own site. It will work like a charm unless you actually have a page called /foo.
If you discover that you haven't got a custom 404 page, it's time to make one - it's simply good form. For inspiration, here's a practical and amusing post on how to create an effective error page.
As an extra step for those who use a tag-based web analytics tool, you will need to make sure your custom error page is properly tagged. The specifics will depend on the tool you use, so go to your solution's document library and read the fabulous manual. In the end you will get error reporting that's just as good as log file solutions, and a lot less effort than parsing raw logs.
Now, after trying out the 3 tactics mentioned here, take all of the problems you've unearthed and submit them as bugs. It will take far longer to fix the errors than it did for you to find them in the first place. Mission accomplished!
Photo credit: Meteorry