redesigning lclark.edu

Choosing and designing our site search

  • 9 April 2009
  • Robb Shecter

This month we’ve started planning how to build the best search ever for Lewis & Clark.  And we’re strongly considering Google Site Search (GSS).  It’s an excellent service, but we’re not sure if it’s the right tool for the job.  I ran a few quick comparisons between GSS and Sphinx on one of my websites — a new online version of the Oregon Revised Statutes.  When I made this search feature, I went with Sphinx because I had a rich object model stored in a SQL database.  It wasn’t much work: excluding look & feel, I implemented the search in a fraction of one day.  Back to the comparison.  Off the bat, I found a few problems with Google’s results.  (NB: I’m not concerned here with differences in appearance, or the snippets.  I also verified that Google had indexed the pages I’d like it to find.)

Here are my site’s results for “robbery“:

And here are the GSS results:

The problems seen with Google Site Search in this small test

1. A problem of unwanted exclusion:  Robbery in the second degree is missing.    Notice also that the results are limited to one page, and the “very similar” rest can be seen after clicking the link.  Maybe Robbery 2 would appear there.

2. A problem of unwanted inclusion:  (I don’t care about the blog hits — those can be filtered out.)  Notice the appearance of 166.715 Definitions in the GSS results, but not OregonLaws.org’s.  This page is actually fairly irrelevant to robbery.  So why did Google rank it so high?  Google is solving a different problem than OregonLaws.org.  Google indexes web pages, but can’t know how important each one is.  And so, their innovation is to look at the number and quality of links to a page, and consider each one a “vote” for it.  My theory is that GSS ranks these Definitions pages high because so many other pages on the site link back to them.

But what about OregonLaws.org’s search?  How does it know to rank the Definition pages so low?  Easy.  When making the site, I know which pieces are important.  I don’t need to look at something as tangentially related as incoming links.  Take a look: here’s the algorithm I used to implement the search for ORS Sections:

 define_index do
    indexes title
    indexes body
    indexes number
    indexes annotations
    set_property :field_weights =>
      {”number” => 10, “title” => 6, “body” => 3, “annotations” => 2}
 end

This is the Ruby on Rails code.  It should be easy to see what’s going on.  My website assembles pages from “objects”.  One object type is “ORS Section”, which has attributes such as title, body, number, and annotations.  It has other attributes too, but I don’t want the search to account for them, so I left those out.  And finally, I’ve set relative weights for each of these fields which produce the relevant search results I want.

Epilog: Another small problem

Google has similar pages links.  OregonLaws.org has more like this.  Here are the results when clicking the respective similar/more link under Robbery first degree:

Filed Under

Post a Comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>