log file—something your site’s search engine likely does automatically. Or the search activity gets intercepted, like other analytics data, by a snippet of JavaScript code embedded in each page and template. The intercepted data then gets stored in a database. That’s how Google Analytics, Omniture, Unica, and other analytics applications do it. You really don’t need to know much about how this code works, but now you can at least claim to have seen it.
<script type="text/javascript" src="http://www.google-analytics.com/urchin.js "> </script><script type="text/javascript">_uacct = "UA-xxxxxx-x"urchinTracker(); </script>
Although search engines and your analytics application may gather search data, they’re traditionally and disappointingly remiss at providing reports on site search performance. Even when they do, you still may want to get at the raw data to analyze and learn things that the reports—which tend to be quite generic—won’t tell you.[3] So it’s useful to know the basic anatomy of search data because it will help you understand what can and can’t be analyzed. We’ll cover just the basics here. (See Avi Rappoport’s more extensive coverage of the topic at the end of this chapter.)
Minimally, your data consists of records of queries that were submitted to your site’s search engine. On a good day, your data will also include the numbers of results each query retrieved. On a really good day, each query will be date/time stamped so you can get an idea of when different searches were happening. On a really, really good day, your data will also include information on who—such as an individual, by way of tracking her cookie, or a segment of users that you determine by their login credentials—is actually doing the searching.
Here’s a tiny sample of query data that must have arrived on one of those really, really good days. It comes from a U.S. state government Web site that uses Google Search Appliance. It’s really ugly stuff; so to make it more readable, we’ve bolded the critical elements: IP address, time/date stamp, query, and # of results:
XXX.XXX.X.104 - - [10/Jul/2006:10:25:46 -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort= date%3AD%3AL%3Ad1&ud=1&site= AllSites&ie=UTF-8&client=www&oe=UTF- 8&proxystylesheet=www&q=lincense+plate&ip= XXX.XXX.X.104 HTTP/1.1" 200 971 0 0.02 XXX.XXX.X.104 - - [10/Jul/2006:10:25:48 -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort= date%3AD%3AL%3Ad1&ie=UTF- 8&client=www&q=license+plate &ud=1&site=AllSites&spell=1&oe= UTF-8&proxystylesheet=www&ip=XXX.XXX.X.104 HTTP/1.1" 200 8283 146 0.16
Even with a little bit of data—in this case, two queries—we can learn something about how people search a site. In this case, the searcher from IP address ...104 entered lincense plate at 10:25 a.m. on July 10, 2006, and retrieved zero results (that’s the next-to-last number in each record). No surprise there. Just a couple seconds later, the searcher entered license plate and retrieved 146 results.
These are just two queries, but they certainly can get you thinking. For example, we might reasonably guess that the first effort was a typo. If, during our analysis, we saw lots more typos, we probably ought to make sure the search engine could handle spellchecking. And we might want to make extra sure that, if license plate was a frequent query, the site contained good content on license plates, and that it always came up at the top of the search results page. There are many more questions and ideas that would come up from reviewing the search data. But most of all, we’d like to know if the users were happy with the experience. In this example, were they?
Heaven knows. The data is good at telling us what happened, but it doesn’t tell why the session ended there. You’ll need to use a qualitative research method if you wanted to learn more. (We’ll get into this what/why dichotomy quite a bit in Chapter 11.)
[3] Once you have the raw data, you’ll need to parse out the good stuff, and then use a spreadsheet or application to analyze it. Here’s a PERL script from the good people at Michigan State University that you can use to parse it:
www.rosenfeldmedia.com/books/searchanalytics/content/code_samples/. And here’s a spreadsheet you can use to analyze it: http://rosenfeldmedia.com/books/searchanalytics/blog/free_ms_excel_template_for_ana/George Kingsley Zipf, Harvard Linguist and Hockey Star
Of course, we’ve just been looking at a tiny slice of a search log. And as interesting as it is, the true power of SSA comes from collectively analyzing the thousands or millions of such interactions that take place on your site during a given period of time. That’s when the patterns emerge, when trends take shape, and when there’s enough activity to merit measuring—and drawing interesting conclusions.
Nowhere is the value of statistical analysis more apparent than when viewing the Zipf Distribution, named for Harvard linguist George Kingsley Zipf, who, as you’d expect from a linguist, liked to count words.[4] He found that a few terms were used quite often, while many were hardly used at all. We find the same thing when tallying up queries from most to least frequent, as in Figure 2-4.
The Zipf distribution—which emerges when tallying just about any site’s search data—shows that the few most common queries account for a surprisingly large portion of all search activity during any given period. (Remember in Chapter 1, how John Ferrara focused exclusively on those common queries.) You can see how tall and narrow what we’ll call the “short head” is, and how quickly it drops down to the “long tail” of esoteric queries (technically, described as “twosies” and “onesies”). In fact, we’re only showing the first 500 or so queries here; in reality, this site’s long tail would extend into the tens of thousands, many meters to the right of where you sit.
http://www.flickr.com/photos/rosenfeldmedia/5690405271/
Figure 2-4. The hockey-stick-shaped Zipf Distribution shows that a few queries are very popular, while most are not. This example is from Michigan State University, but this distribution is true of just about every Web site and intranet.
It’s equally enlightening to examine the same phenomenon when presented textually, as shown in Table 2-1
The most common query, campus map, accounts for 1.4% of all the search activity during this time period. That number, 1.4%, doesn’t sound like much, but those top queries add up very quickly—the top 14 most common queries account for 10% of all search activity. (Note to MSU.edu webmaster: better make sure that relevant results come up when users search campus map!)
Table 2-1.
http://www.flickr.com/photos/rosenfeldmedia/5825543717/
The ZIPF Distribution Shown
|
---|