to assess how reasonable it was for the search engine to return each result, and whether or not the search engine put it in the right place. He recorded an r, n, m, or i for each result in a spreadsheet, as shown in Figure 1-2.
http://www.flickr.com/photos/rosenfeldmedia/5690980818/
Figure 1-2. Each result for each query was rated as Relevant, Near, Misplaced, or Irrelevant.
John then used a few different ways to calculate precision for each query. He came up with three simple standards—strict, loose, and permissive—to reflect a range of tolerances for different levels of precision.
Strict: Only results ranked as relevant were acceptable (r).
Loose: Both relevant and near results were counted (r+n).
Permissive: Relevant, near, and misplaced results were counted (r+n+m).
You can see how each query scored differently for each of these three precision standards in Figure 1-3. For example, of the first five search results for the query “reserve room,” two were relevant (r), two were nearly relevant (n), and one was misplaced (m). In strict terms, precision was 40% (two of five results were relevant); in loose terms, 80% (four of five were relevant or nearly relevant); and all were relevant in permissive terms.
http://www.flickr.com/photos/rosenfeldmedia/5690405259/
Figure 1-3. Each query’s precision scores were then calculated in three different ways: Strict, Loose, and Permissive.
[1] Chris Anderson’s excellent book The Long Tail (Hyperion, 2006) described the long tail phenomenon and its impact on commerce sites like Amazon and Netflix.
[2] In web analytics, these are referred to as accuracy and precision.
The Brake Works—Thanks to Site Search Analytics
John’s two tests of the original search engine—relevancy and precision—yielded two sets of corresponding metrics that helped his team compare the new engine’s performance against the old one (shown in Figure 1-4). The five relevancy metrics above the line were all based on how close to the top position the “ideal search result” placed. So the smaller the number, the better. For the “Target”—the benchmark figures based on the old search engine—the top queries’ ideal results placed, on average, three places below #1, where they ideally would have been displayed. John looked at the same data in different ways, using a median count, and three percentages that showed how often the ideal result was below the #1, #5, and #10 positions, respectively.
John used different metrics for precision as well—the strict, loose, and permissive measures described previously. In this case, bigger numbers were better because they meant a higher percentage of the top five results were relevant. As mentioned, the “Target” scores were the benchmark; they showed how the old search engine was performing. And the “Oct 3” scores showed how the new search engine was performing. The verdict, as you can see in Figure 1-4, was not pretty.
http://www.flickr.com/photos/rosenfeldmedia/5690405181/
Figure 1-4. The new search engine (“Oct 3”) performed worse than the old one (“Target”) for each metric.
Ouch. The numbers didn’t lie—the new search engine, first measured on October 3, was performing worse on each metric than the old engine!
John and his project manager now had the proof they needed to convince the IT folks that the new engine’s poor performance wasn’t just something that came to him after a hard night of partying. The problem was real and serious, dire enough that he thought people could lose their jobs if the new search engine launched as is. IT responded accordingly. While the staff still were obligated to make the same launch deadline, they eliminated some planned features in favor of fixing the problem. Over the coming weeks, they identified the sources of the problems. The primary culprit—a misconfigured configuration file that was missed by Vanguard’s search engine consultant—was fortunately a fairly simple fix. And it wouldn’t have been detected without site search analytics.
You can see how their work progressed to the point where, by launch, they’d at least come close to getting the new search engine (as of October 16) to work about as well as the old one for each of the eight metrics, as shown in Figure 1-5.
http://www.flickr.com/photos/rosenfeldmedia/5690405199/
Figure 1-5. As the launch date approached, the new search engine’s performance improved dramatically—to the point where it had caught up with the old engine’s performance.
So John’s gut reaction was validated, and he had the numbers to back up his argument that some hard work was in line before the new engine launched. The search experience was measured, a problem was recognized and identified, the search engine was fixed, firings were averted, and egg-on-face was avoided. Since the launch, Vanguard has continued to monitor these metrics and fine-tune the engine’s performance accordingly. It’s now performing much, much better than the original search engine. And that’s where our happy story ends.
Moral of the Story: Be Like John
There’s an important takeaway from this case study: that UX practitioners and other designers should not only pay more attention to the numbers, but it’s their responsibility to employ quantitative approaches to research and evaluate the user experience. If no one at Vanguard had taken on this responsibility, the entire project might have failed miserably.
And though it’s not a lesson, there’s another important point worth remembering: this is just the tip of the SSA iceberg. There’s much more that can and should be done with your site’s search query data.
In this book, I’ll cover many of the ways you can use SSA to better align your site with your business strategy,