Friday, April 17, 2009

Stem Searching with couchdb-lucene

‹prev | My Chain | next›

Next up in my scenarios is Matching a word stem in the recipe instructions. Word stems reduce words to their lowest common denominator so that searching for the word "whisk" will match documents containing the word "whisking".

The entire scenario:
    Scenario: Matching a word stem in the recipe instructions

Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
As with the last scenario, there are relatively few steps that need to be implemented anew. The Given a recipes with instructions step can be implemented thusly:
Given /^a "(.+)" recipe with instructions "(.+)"$/ do |title, instructions|
date = Date.new(2009, 4, 16)
permalink = "id-#{title.gsub(/\W/, '-')}"

recipe = {
:title => title,
:date => date,
:instructions => instructions
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'
end
This is really starting to look familiar. My red-green-refactor cycle may need a little more refactor. Another day.

With that in place, I have but one failure remaining:
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
expected following output to contain a <a href='/recipes/id-french toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr>
<th>Name</th>
<th>Date</th>
</tr></table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `And /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:32:in `And I should see the "french toast" recipe in the search results'


1 scenario
5 steps passed
1 step failed
This failure shows that no recipes are showing up in the search results, which means that stemming is not being used in couchdb-lucene. Inspecting src/main/java/com/github/rnewson/couchdb/lucene/Config.java, one can see that it uses the (non-stemming) StandardAnalyzer:
...
final class Config {

static final Analyzer ANALYZER = new StandardAnalyzer();
...
}
To get it using using a custom (stemming) analyzer, create src/main/java/com/github/rnewson/couchdb/lucene/MyAnalyzer.java:
package com.github.rnewson.couchdb.lucene;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;

class MyAnalyzer extends Analyzer {
public final TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(new LowerCaseTokenizer(reader));
}
}
There is nothing fancy in there—it is take directly from the lucene documentation. Then, change the configuration to use MyAnalyzer:
...
final class Config {

static final Analyzer ANALYZER = new MyAnalyzer();

...
}
Finally compile the jar files with maven by invoking mvn. My local development version of CouchDB is already pointing to the compiled jar, so all I need to is start it up with ./utils/run and re-run cucumber:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Matching a word stem in the recipe instructions"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
expected following output to contain a <a href='/recipes/id-french toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table>
<tr>
<th>Name</th>
<th>Date</th>
</tr>
<tr class="row0">
<td>
<a href="/recipes/id-french-toast">french toast</a>
</td>
<td>2009-04-16</td>
</tr>
</table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `And /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:32:in `And I should see the "french toast" recipe in the search results'


1 scenario
5 steps passed
1 step failed
Hunh?! The french toast recipe (that requires "whisking") is now showing up in the search results, why is it failing?

Ah nuts, the link being tested for is missing a dash. Add a gsub to the step:
Then /^I should see the "(.+)" recipe in the search results$/ do |title|
response.should have_selector("a",
:href => "/recipes/id-#{title.gsub(/\W/, '-')}",
:content => title)
end
And we have verified stemming working!
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Matching a word stem in the recipe instructions"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results


1 scenario
6 steps passed

(commit)

Update: I forked couchdb-lucene so that I could continue to use the stemming analyzer, while still tacking changes to the master.

No comments:

Post a Comment