Every so often I get a request along the lines of, "Hey web analyst, run me a broken links report." Although simple enough, this task is not without its nuances. Here are 3 tactics you can use to find broken links on your own site when you (inevitably) get asked:
1) Set forth a spider
You may be the resident expert in whatever web analytics tool your company uses, and when someone asks you about broken links they may expect that you'll use your web analytics tool to find broken links for them. However, it's not your only choice! Spiders are worth considering for this task, as well; they'll identify a slightly different set of broken links than you get from your web analytics tool.
Here's how I like to think about the difference: web analytics tools tell you which URLs visitors actually try and fail to view. Spiders tell you which URLs throw errors regardless of whether any visitor clicks the link or not, and regardless of whether the target URL lives on your site or elsewhere on the internet.
Newbie explanation: spiders are software programs that crawl pages on your site and scan for certain things like broken links. If your site employs a lot of JavaScript and forms and Flash and web 2.0 whiz-bang, you'll need to find a spider that can handle the complexity. There are basic free tools (like the W3C Link Checker) and fancy not-so-free tools (like this one from Accenture) and plenty of options in between. If you haven't done so already, go test-drive a spider.
2) Analyze your logs
Back in the old days, when nearly every web analyst dealt with log files, it was actually quite easy to get a broken link report because it comes standard with log-processing solutions. These days, with so many of us using JavaScript tags alone for data collection, error tracking takes a bit of customization that you must initiate.
If your web analytics tool uses log files, now is the time to locate your broken link report. If your tool uses JavaScript tags, it's worth checking to see if broken link reporting has been set up. If, for whatever reason, you don't have broken link reporting yet, there's still hope! You can still get something useful from your raw logs.
Normally I'd say don't mess with your logs - it's a slippery slope and before you know it you'll find yourself writing your own web analytics application from scratch. Don't go there! However, logs can actually be useful for error reporting, at least as a once-off project. The basic information about a broken link is contained within a single row, so there's no need to sessionalize multiple rows into visits.
Here are a few guidelines for log handling, should you have the need. Since log files are quite large your task will be easier if you pick a fairly short, recent date range for analysis. Next, you (or your sys admin) can filter down the logs so you just get the rows with status code 404. Pop your data into Excel (it fits in Excel, right?) and make sure you can identify what each column contains. The columns you'll be most interested in are requested URL (the broken link itself) and referring URL (the page on which the broken link appeared, if applicable). Finally, make a pivot table and identify the most common broken links and their sources.
3) Have a custom 404 page, and tag it
Allow me to demonstrate how to tell if any site has a custom error page: http://june.typepad.com/foo Oh no, a broken link! Now I know what my 404 page looks like, and by viewing source I can see if it's got a web analytics tag on it (as of this writing it does not). Try this on your own site. It will work like a charm unless you actually have a page called /foo.
If you discover that you haven't got a custom 404 page, it's time to make one - it's simply good form. For inspiration, here's a practical and amusing post on how to create an effective error page.
As an extra step for those who use a tag-based web analytics tool, you will need to make sure your custom error page is properly tagged. The specifics will depend on the tool you use, so go to your solution's document library and read the fabulous manual. In the end you will get error reporting that's just as good as log file solutions, and a lot less effort than parsing raw logs.
Now, after trying out the 3 tactics mentioned here, take all of the problems you've unearthed and submit them as bugs. It will take far longer to fix the errors than it did for you to find them in the first place. Mission accomplished!
Photo credit: Meteorry
Nice post June.
Always surprised that people don't spend time looking at errors (404's not being the only ones).
One you did miss is checking for broken links generated by other sites linking to your site. It's often easy to leverage the known value of where the 404 came from and then simply contact site owners to have them update their links. That way you don't have to spend week after week looking at errors that aren't your fault.
My personal favourite link checking tools:
- XENU's Link Sleuth: http://home.snafu.de/tilman/xenulink.html (Windows)
- Integrity: http://peacockmedia.co.uk/integrity (Mac)
Posted by: benry | July 31, 2008 at 09:42 PM
Good post. If there is one report that provides us the most actionable data it would be our 404 report. Both from a user experience and an SEO link building point of view.
At first we installed the out of the box tagged error tracking code provided by our web analytics vendor on our custom 404 page. Initially we were surprised that it was not part of the standard implementation guide.
Once implemented we saw the reason. The report generated separate reports of the broken links and the pages with broken links. Unfortunately there was no correlation between the reports. You could navigate to the host page but had to click every link to find the broken URL. It only takes one page to see how painful that could be.
Instead, our 404 page now stores both the referrer and the broken URL together within a custom variable. This has helped tremendously to fix internal links quickly and efficiently.
The 404 report also provides us the ability to identify external sites that are linking to us. We can now specify the page and specific link to the webmaster. It has helped improve our link building effort as well as provided an opportunity to optimize the text of the inbound link.
Good point on the spider option. That will definitely be next on the project list.
Regards,
Rob
Posted by: Rob Angley | August 02, 2008 at 06:26 AM
Benry and Rob, thank you for your comments! You both stress the importance of being able to find broken links to your site that exist elsewhere on the internet. Note that this is something you'll need to use your web analytics tool or raw logs for - you won't get it from a spider unless you crawl the entire universe (not recommended).
Also, Rob, thanks for sharing your experience regarding broken link tracking with page tags. The most valuable report - as you point out - is the one that shows the requested URL and the referring URL side-by-side.
Posted by: June Dershewitz | August 04, 2008 at 11:21 AM
Hi June,
Very interesting post - and appreciate the helpful link to the getelastic page.
Question - have you figured out how to set up a 404 page on Typepad? So far have not been able to do this with my own blog. I've a ticket opened with Typepad, so can let you know later if you are interested in the response.
Posted by: Alec Satin | August 24, 2008 at 05:13 PM
Alec: I don't know how to set up a custom 404 page on Typepad, and I'm definitely interested in the response you get from support. I imagine that most broken links on a blog will reference external sites, so an occasional spider run is a good idea.
Posted by: June Dershewitz | August 24, 2008 at 06:32 PM
Hi June - Typepad specifically does not support a custom 404 error page, nor do they allow editing of the default error page.
From their response:
Your account with TypePad is for the weblogging service and
file storage, not for a traditional hosting plan. Features
that are sometimes available with hosting plans such as a
404 page are not available with TypePad.
I'm very happy with TypePad overall - but just wish they had a little more flexibility. Still, am willing to trade that for the certainty of knowing that the technical issues are being managed professionally 24/7.
Posted by: Alec Satin | August 25, 2008 at 03:00 AM
This is a good list, although, realistically, I'd reverse the order. My tendency is to shy away from spidering the site looking for links -- not only does it miss inbound links, but it also doesn't do anything to prioritize the fixes. It's a little bit of a "if a tree falls in the woods and no one is there to hear it, does it make a sound?" situation: "if a link is busted, but no one ever clicks on it, does it need to be fixed?" Sure, given infinite resources, you want to eliminate all broken links. But, in my experience, you can get to a point of diminishing returns pretty quickly.
IF you need to convince other resources to make changes to fix the links, then it's best to be able to say, "Here's a list of bad links, and they're ordered in descending order by how much traffic they are getting."
Posted by: Tim Wilson | September 12, 2008 at 10:40 AM
Thanks for reading and commenting, Tim! You've made a really good point about the need to prioritize fixes based on traffic volume. I realize that I listed my recommendations in order of "level of IT involvement," since that is often a factor in whether or not 404s get tracked.
Posted by: June Dershewitz | September 15, 2008 at 09:48 AM
I wouldnt say broken links are more important dependent on the traffic they get. Ranking algorithms are unaware of traffic volumes so surely have to treat each equally?
Probably a better statistic would simply be how many times that broken link is found. If you link to a certain page from 90% of your site and that page breaks... thats a lot of broken links.
Another option for finding broken links: http://www.bitbotapp.com
Posted by: Henry T | November 07, 2008 at 04:53 AM
Thank you for this good post. it will help me a lot in my work, we know that it is very sensitive in a sense that if your links are broken, when you search that link you found a 404. and your site cannot crawl by google,yahoo and msn. thanks again and hope you will post an article like this one..
-faith-
Posted by: website seo services | March 16, 2009 at 06:11 PM
In my experience as an SEO, links are very important. Its through them that you market a specific client's website. If there are broken links in your campaign, you must immediately find a solution for it so that link will be useful again not only to you but to viewers online who want to click on the said link.
Posted by: seo expert | October 12, 2009 at 04:10 PM
"Have a custom 404 page, and tag it." - I applied this step in some of my websites and this really worked for me. In terms of SEO, we must make sure that every link is working so that search engine crawlers can find them and eventually index the website pages.
http://www.365outsource.com/seo-reseller
Posted by: Alex O | April 01, 2011 at 02:02 AM