My Photo

Twitter Updates

    follow me on Twitter

    « Podcast Interview on the Aquent Talent Blog | Main | Speaking, Listening, Partying, Learning and Otherwise Engaged at eMetrics SF »

    April 23, 2008

    Comments

    Feed You can follow this conversation by subscribing to the comment feed for this post.

    Alex

    Hey June,

    Are you kidding me? This happens all the time! I went through a month of it with one client until we had sorted out all of the bugs.

    I usually recommend a 5% threshold of tolerance, sometimes up to 10% if it's reasonably consistent. It all depends upon the metrics and confidence necessary to make business decisions vs. cost of sorting out the difference.

    -Alex

    Tim Wilson

    One other thing to keep in mind is to not try to use the reconciliation to jump to a single "multiplier" that can be used to match data from one system to another. We found this out when running a log-based system and a tag-based system in parallel and doing this reconciliation. We were down to the point of test pages and line-by-line comparison of the page tag log file to the server log file. We could explain ~70% of the differences...but flat out left scratching our heads on the others.

    One thing we realized, though, was that, while the trends would be the same over time, the % delta would vary based on the number of pages being reported on. Our theory there was that there were a lot of unfiltered (by our log-based system) or cloaked spiders (by the spider owner) that would hit random pages on our site, but not the entire site. These would boost the "Visits to the entire site" quite a bit. If we looked at a single page, especially a page that was buried in the site somewhere, then fewer of these spiders would hit the page.

    We let the systems run in parallel for six months and told people to ask us if they wanted us to do a reconciliation. Typically, that meant that we would trend data from both systems and make sure they were consistent. Then, we'd respond with the list of differences that drove the difference, but didn't quantify them in each case.

    I tend to shoot for <20%, actually. And that's primarily dealing with two tag-based systems -- one being a marketing automation that "includes web analytics" but is not a true web analytics tool.

    June Dershewitz

    Alex: Your comment makes it absolutely clear that we have a shared experience, here. :)

    Tim: I appreciate your observation about the variation in delta being based on the number of pages in the sample set. I have seen that, too. That's exactly why I recommended, in step #4, isolating a subset of data based on a common attribute like URL. And yeah, spiders can be a major culprit - it's worth, in the beginning of a reconciliation project, learning exactly how each system handles the filtering of spiders and other traffic.

    The comments to this entry are closed.