thinking sphinx
Choosing and designing our site search
- 9 April 2009
This month we’ve started planning how to build the best search ever for Lewis & Clark. And we’re strongly considering Google Site Search (GSS). It’s an excellent service, but we’re not sure if it’s the right tool for the job. I ran a few quick comparisons between GSS and Sphinx on one of my websites — a new online version of the Oregon Revised Statutes. When I made this search feature, I went with Sphinx because I had a rich object model stored in a SQL database. It wasn’t much work: excluding look & feel, I implemented the search in a fraction of one day. Back to the comparison. Off the bat, I found a few problems with Google’s results. (NB: I’m not concerned here with differences in appearance, or the snippets. I also verified that Google had indexed the pages I’d like it to find.)
Here are my site’s results for “robbery“:
And here are the GSS results:
The problems seen with Google Site Search in this small test
1. A problem of unwanted exclusion: Robbery in the second degree is missing. Notice also that the results are limited to one page, and the “very similar” rest can be seen after clicking the link. Maybe Robbery 2 would appear there.
2. A problem of unwanted inclusion: (I don’t care about the blog hits — those can be filtered out.) Notice the appearance of 166.715 Definitions in the GSS results, but not OregonLaws.org’s. This page is actually fairly irrelevant to robbery. So why did Google rank it so high? Google is solving a different problem than OregonLaws.org. Google indexes web pages, but can’t know how important each one is. And so, their innovation is to look at the number and quality of links to a page, and consider each one a “vote” for it. My theory is that GSS ranks these Definitions pages high because so many other pages on the site link back to them.
But what about OregonLaws.org’s search? How does it know to rank the Definition pages so low? Easy. When making the site, I know which pieces are important. I don’t need to look at something as tangentially related as incoming links. Take a look: here’s the algorithm I used to implement the search for ORS Sections:
define_index do
indexes title
indexes body
indexes number
indexes annotations
set_property :field_weights =>
{”number” => 10, “title” => 6, “body” => 3, “annotations” => 2}
end
This is the Ruby on Rails code. It should be easy to see what’s going on. My website assembles pages from “objects”. One object type is “ORS Section”, which has attributes such as title, body, number, and annotations. It has other attributes too, but I don’t want the search to account for them, so I left those out. And finally, I’ve set relative weights for each of these fields which produce the relevant search results I want.
Epilog: Another small problem
Google has similar pages links. OregonLaws.org has more like this. Here are the results when clicking the respective similar/more link under Robbery first degree:![]()
![]()
![The WhiteBoard [home]](http://www.lclark.edu/global/images/transparent.gif)




