Thursday, April 30, 2009

Down the Sort Hole

‹prev | My Chain | next›

With an understanding between me and couchdb-lucene sorting, I start back with implementation. In the Sinatra application's spec for /recipes/search, I add:
    it "should sort" do
RestClient.should_receive(:get).
with(/sort=title/).
and_return('{"total_rows":30,"skip":0,"limit":20,"rows":[]}')

get "/recipes/search?q=title:egg&sort=title"
end
I make this example pass by simply passing the sort parameter through to couchdb-lucene:
data = RestClient.get "#{@@db}/_fti?limit=20&skip=#{skip}&q=#{params[:q]}&sort=#{params[:sort]}"
That spec may pass, but my cucumber scenario no longer does:
cstrom@jaynestown:~/repos/eee-code$ cucumber -n features \
-s "Sorting (name, date, preparation time, number of ingredients)"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Sorting (name, date, preparation time, number of ingredients)
Given 50 "delicious" recipes with ascending names, dates, preparation times, and number of ingredients
And a 0.5 second wait to allow the search index to be updated
When I search for "delicious"
HTTP status code 400 (RestClient::RequestFailed)
/usr/lib/ruby/1.8/net/http.rb:543:in `start'
./features/support/../../eee.rb:30:in `GET /recipes/search'
(eval):7:in `get'
features/recipe_search.feature:79:in `When I search for "delicious"'
Then I should see 20 results
When I click the "Name" column header
...
The cucumber scenario is not even reaching the sorting steps—it is failing on the simple search-for-a-string step. The cause of the failure is couchdb-lucene's dislike of empty (or non-indexed) sort fields. I have to guard against empty sort parameters:
    it "should not sort when no sort field is supplied" do
RestClient.stub!(:get).
and_return('{"total_rows":30,"skip":0,"limit":20,"rows":[]}')

RestClient.should_not_receive(:get).with(/sort=/)

get "/recipes/search?q=title:egg&sort="
end
I can implement this example thusly:
get '/recipes/search' do
@query = params[:q]

page = params[:page].to_i
skip = (page < 2) ? 0 : ((page - 1) * 20) + 1

couchdb_url = "#{@@db}/_fti?limit=20" +
"&q=#{@query}" +
"&skip=#{skip}"

if params[:sort] =~ /\w/
couchdb_url += "&sort=#{params[:sort]}"
end

data = RestClient.get couchdb_url

@results = JSON.parse(data)

if @results['rows'].size == 0 && page > 1
redirect("/recipes/search?q=#{@query}")
return
end

haml :search
end
With that, my Cucumber scenarios are again passing and I am ready to proceed with the view / helper work.
(commit)

Shortly after starting work in the Haml template, the sort field gets unwieldy, which is a good indication that it ought to be a helper. I opt for the name of sort_link for the helper and build the following examples to describe how it should work:
describe "sort_link" do
it "should link the supplied text" do
sort_link("Foo", "sort_foo", "query").
should have_selector("a",
:content => "Foo")
end
it "should link to the query with the supplied sort field" do
sort_link("Foo", "sort_foo", "query").
should have_selector("a",
:href => "/recipes/search?q=query&sort=sort_foo")
end
end
I implement this code as:
    def sort_link(text, sort_on, query)
id = "sort-by-#{text.downcase}"
url = "/recipes/search?q=#{query}&sort=#{sort_on}"
%Q|#{text}|
end
There are no example for the link's id. That is semantic information, having nothing to do with behavior of the application. The only reason to include it is for styling and, more importantly, the Cucumber scenario.

Speaking of the Cucumber scenario, I am now ready to implement the next step, Then the results should be ordered by name in ascending order, which is aided by some CSS selector fanciness:
Then /^the results should be ordered by name in ascending order$/ do
response.should have_selector("tr:nth-child(2) a",
:content => "delicious recipe 1")
response.should have_selector("tr:nth-child(3) a",
:content => "delicious recipe 10")
end
The first child of the results table is the header, which is the reason the first selector is looking for the second child. The reason for the second test is that I want to ensure that sorting has taken place. The "delicious recipe 1" was the first recipe entered, so it may show up in the results list first for that reason alone. But "delicious recipe 10" will come before "delicious recipe 2" only if they have been sorted (because the "1" in "10" comes before "2" when performing text sorting).
(commit)

Up next: reversing the sort order.

Wednesday, April 29, 2009

Couchdb-lucene Sorting

‹prev | My Chain | next›

To sort on a field in couchdb-lucene (or Lucene proper for that matter), the field cannot be analyzed / tokenized. If you have a recipe with a title of "Chocolate Chip Pancakes", the title field will be indexed with three tokens: "chocolate", "chip" and "pancakes". That way each term can be found in the index and readily associated back to the original recipe / document.

Inverted indexes work well for searching, but not so well for sorting. Which token would be used for sorting? "Chocolate" because it was the first term? "Chip" because it comes first alphabetically? Lucene handles this by simply refusing to sort on such fields. It may not sort this way, but Lucene does support sorting.

It does so by storing fields not-analyzed (without tokenizing). Couchdb-lucene supports this feature via a "not_analyzed" argument to the Document's field method. To get this working with the title and date fields, I need to add this to the lucene design document:
  ret.field('sort_title', doc['title'], 'yes', 'not_analyzed');
ret.field('sort_date', doc['date'], 'yes', 'not_analyzed');
Re-indexing (which I do by removing the lucene directory from the couchdb directory), and then searching my development database, I get results back with a "sort_order" attribute:
cstrom@jaynestown:~/repos/couchdb-lucene$ curl http://localhost:5984/eee/_fti?q=ingredient:salt\&sort=sort_date
{"q":"+_db:eee +ingredient:salt",
"etag":"120f4ad1fc3",
"skip":0,
"limit":25,
"total_rows":7,
"search_duration":0,
"fetch_duration":14,
"sort_order":[{"field":"sort_date","reverse":false,"type":"string"},
{"reverse":false,"type":"doc"}],
"rows":[{"_id":"2002-01-13-hollandaise_sauce",
...}
To reverse the sort order in couchdb-lucene, you need to prepend a back-slash to the field being sorted on (double back-slashes to prevent the shell from interpreting it):
cstrom@jaynestown:~/repos/couchdb-lucene$ curl http://localhost:5984/eee/_fti?q=ingredient:salt\&sort=\\sort_date
{"q":"+_db:eee +ingredient:salt",
"etag":"120f4ad1fc3",
"skip":0,
"limit":25,
"total_rows":7,
"search_duration":0,
"fetch_duration":14,
"sort_order":[{"field":"sort_date","reverse":true,"type":"string"},
{"reverse":false,"type":"doc"}],
"rows":[{"_id":"2008-07-21-spinach",
...}
What this will mean for my Sinatra app is that I need to pass sort parameters to couchdb-lucene, but do not need to store them as instance variables. The view and helper can pull the sort information directly from the result set.

I will work on that tomorrow. For today, I update the lucene design document (as described above) and implement this scenario step:
When /^I click the "([^\"]*)" column header$/ do |link|
click_link("sort-by-#{link.downcase()}")
end
by adding an ID attribute to the search results column heading in the Haml template:
    %th
%a{:href => "/recipes/search?q=foo&sort=name", :id => "sort-by-name"} Name

(commit)

Tuesday, April 28, 2009

Sorting

‹prev | My Chain | next›

Up first in the sorting scenario are the creation of sortable, dummy recipes:
Given /^(\d+) "([^\"]*)" recipes with ascending names, dates, preparation times, and number of ingredients$/ do |count, keyword|
date = Date.new(2008, 4, 28)

(1..count.to_i).each do |i|
permalink = "id-#{i}-#{keyword.gsub(/\W/, '-')}"

recipe = {
:title => "#{keyword} recipe #{i}",
:date => date + i,
:prep_time => i,
:preparations =>
(1..count.to_i).
map {|j| { :ingredient => { :name => "ingredient #{j}"}} }
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'
end
end
There are a few differences between this and the creation of the dummy records for pagination. The date for each recipe exploits date arithmetic by adding a day to each subsequent recipe. Similarly, each recipe has 1 more ingredient that its predecessor.

With that, there are 4 steps passing (3 were defined by other scenarios) and 6 skipped (also defined by other scenarios, but not reached). This leave 13 that need to be defined.



Before wading into the code, I think I will implement the next undefined step (clicking the search results header for sorting). Doing so will get my mind right for implementing examples:
When /^I click the "([^\"]*)" column header$/ do |link|
click_link("sort-by-#{link.downcase()}")
end
When clicking the "Name" column heading, for example, I expect to find it in an a tag with an id of "sort-by-name".

Unfortunately, when I start work on the view, I run into a problem implementing this example:
  it "should link to sort by date" do
assigns[:query] = "foo"
render("/views/search.haml")
response.should have_selector("th > a",
:href => "/recipes/search?q=foo&sort=name",
:content => "name")
end
The error that occurs is:
cstrom@jaynestown:~/repos/eee-code$ spec ./spec/views/search.haml_spec.rb
.....F

1)
'search.haml should link to sort by date' FAILED
expected following output to contain a <th > a href='/recipes/search?q=foo&sort=name'>name</th > a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
<tr>
<th>
<a href="/recipes/search?q=foo&sort=name">Name</a>
</th>
...
Initially, I think the problem is that Haml is escaping the ampersand that is separating the query parameters. I am not expecting it to do so.

I spent a lot of time Googling to see how to prevent Haml from escaping the href attribute. Ultimately I was unsuccessful in that pursuit. What I did find was that it is apparently OK to escape them in query parameters. I knew about the semi-colon separators from my XML/XSL days, but being able to use &amp; is news to me. Something for #standup tomorrow!

Ultimately, this was not the problem. My mistake was that I was expecting lowercase "name" in the a tag, but am generating an upper case "Name". Once fixed, all specs pass for tonight.

Really could have used a pair to save me from wasting time tonight. Ah well, there's always tomorrow.
(commit)

Monday, April 27, 2009

Sorting Stories

‹prev | My Chain | next›

With the pagination work behind me, I need to get sorting working. Sorting will default to ascending order, except for dates (most recent date should be first). When I am on a page after the first and sort by a different field, pagination should start back on page 1. When I am on a page after the first and reverse the sort order, I should also be taken back to page 1. The scenario that describes this is:
    Scenario: Sorting (name, date, preparation time, number of ingredients)

Given 50 "delicious" recipes with ascending names, dates, preparation times, and number of ingredients
And of 0.5 second wait to allow the search index to be updated
When I search for "delicious"
Then I should see 20 results
When I click the "Name" column header
Then the results should be ordered by name in ascending order
When I click the "Name" column header
Then the results should be ordered by name in descending order
When I click the next page
Then I should see page 2
And the results should be ordered by name in descending order
When I click the "Date" column header
Then I should see page 1
And the results should be ordered by date in descending order
When I click the next page
Then I should see page 2
When I click the "Date" column header
Then the results should be ordered by date in ascending order
And I should see page 1
When I click the "Prep" column header
Then the results should be ordered by preparation time in ascending order
When I click the "Ingredients" column header
Then the results should be ordered by the number of ingredients in ascending order
The final two scenarios in the search feature are boundary conditions: no matching results and invalid search parameters. The fully described scenarios:
    Scenario: No matching results

Given 5 "Yummy" recipes
And of 0.5 second wait to allow the search index to be updated
When I search for "delicious"
Then I should see no results
And no result headings


Scenario: Invalid search parameters

Given 5 "Yummy" recipes
And of 0.5 second wait to allow the search index to be updated
When I search for ""
Then I should see no results
When I seach for a pre-ascii character "%1F"
Then I should see no results
And an empty query string
When I search for an invalid lucene search term like "title:ingredient:egg"
Then I should see no results
And an empty query string
(commit)

Whoops

While verifying the Cucumber format of the new scenarios, I notice that I have broken my first search scenario (note to self, you're not done with new scenarios if old ones are broken):
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature \
> -n -s "Matching a word in the ingredient list in full recipe search"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Matching a word in the ingredient list in full recipe search
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
And a 0.5 second wait to allow the search index to be updated
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
expected following output to contain a <a href='/recipes/id-pancake'>pancake</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table><tr>
<th>Name</th>
<th>Date</th>
</tr></table>
<div class="pagination">
<span class="inactive">« Previous</span><a href="/recipes/search?q=chocolate&page=2">Next »</a>
</div>
</body></html>
(Spec::Expectations::ExpectationNotMetError)
features/recipe_search.feature:13:in `Then I should see the "pancake" recipe in the search results'
And I should not see the "french toast" recipe in the search results

1 scenario
1 failed step
1 skipped step
4 passed steps
It turns out that this failure has uncovered a bug that I have introduced in the code. Specifically, I am not including ingredients in the default, 'all' field of the couchdb-lucene index. To resolve, I need to add it to the couchdb-lucene _design/lucene transform design document:
            ret.field('ingredient', ingredients.join(', '), 'yes');
ret.field('all', ingredients.join(', '));
With that, I am back in the green:



(commit)

Sunday, April 26, 2009

Pagination, Page 5

‹prev | My Chain | next›

The next few steps in the search pagination scenario ought to be easy to implement—I believe they are already functional:



I already have the "should see X number of results" implemented (shown in blue). While working inside the actual code, I am reasonably sure that I got the next / previous buttons working as well. Still it is best to make sure...

The next and previous links are rendered as spans when it is not possible for the user to follow them. Therefore, I verify that they are span tags:
Then /^I should not be able to go to a (.+) page$/ do |link_text|
response.should have_selector(".pagination span",
:content => link_text.capitalize())
end
Similarly, I implement the "click the next / previous page" steps as:
When /^I click the (.+) page$/ do |link_text|
click_link(link_text)
end
This is very similar to the step that describes clicking on the numbered pages (e.g. when I click page 3):
When /^I click page (\d+)$/ do |page|
click_link(page)
end
I do not mind duplication in specs / tests. I much prefer it to awkwardly written text.

With that, I have everything but the boundary conditions passing:



Before working through the internals, I can implement the remaining steps:
When /^I visit page "?(.+)"?$/ do |page|
visit(@query + "&page=#{page}")
end

Then /^I should see page (.+)$/ do |page|
response.should have_selector(".pagination span.current",
:content => page)
end
When I run this I get a big old failure, which is to be expected since I am testing boundary conditions:



To handle boundary conditions, I exploit that "foo".to_i, nil.to_i, and "-1".to_i all evaluate to an integer. The example that I use to describe the desired behavior:
    it "should display page 1 when passing a bad page number" do
RestClient.should_receive(:get).
with(/skip=0/).
and_return('{"total_rows":30,"skip":0,"limit":20,"rows":[]}')

get "/recipes/search?q=title:eggs&page=foo"
end
I implement this by refactoring the page and skip calculation in /recipes/search:
get '/recipes/search' do
page = params[:page].to_i
skip = (page < 2) ? 0 : ((page - 1) * 20) + 1
data = RestClient.get "#{@@db}/_fti?limit=20&skip=#{skip}&q=#{params[:q]}"
@results = JSON.parse(data)
@query = params[:q]

if @results['rows'].size == 0 && page > 1
redirect("/recipes/search?q=#{@query}")
return
end

haml :search
end
One final thing that I need to do is mark the current page (so that the scenario knows that it is staying on page 1 when moving past boundaries). In the spec helper test, the example that describes this behavior is:
  it "should mark the current page" do
pagination(@query, @results).
should have_selector("span.current", :content => "1")
end
The code block inside the pagination helper that implements this is:
      links << (1..last_page).map do |page|
if page == current_page
%Q|<span class="current">#{page}</span>|
else
%Q|<a href="#{link}&page=#{page}">#{page}</a>|
end
end
Back out in the feature, I am now greeted with this:



Yay! The whole scenario done!
(commit)

Next up, sorting and some additional boundary conditions.

Saturday, April 25, 2009

Pagination, Page 4

‹prev | My Chain | next›

I find myself struggling quite a bit of late against Sinatra. Perhaps I am trying to do more that it is really meant to do. Perhaps I am foisting my Rails mindset too much upon it. Probably a little bit of both, but I begin to suspect it is more of the latter.

I continue to see strict MVC where it may not apply. I also try to perform unit tests in isolation rather than taking a more holistic approach. So I take a step back and refactor some of my work from yesterday.

So instead of a pagination method signature like this:
pagination(query, skip, limit, total)
I stick with the results that come directly from the couchdb-lucene JSON results:
pagination(query, results)
With that done (and all the specs updated accordingly), I move from the helper back into the Sinatra app itself. I specify that pagination options passed to the Sinatra application should drive couchdb-lucene:
    it "should paginate" do
RestClient.should_receive(:get).
with(/skip=21/).
and_return('{"total_rows":30,"skip":0,"limit":20,"rows":[]}')

get "/recipes/search?q=title:eggs&page=2"
end
To get that passing without breaking any of the previous examples, I calculate the skip offset thusly:
get '/recipes/search' do
skip = (((params[:page] ? params[:page].to_i : 1) - 1) * 20) + 1
data = RestClient.get "#{@@db}/_fti?limit=20&skip=#{skip}&q=#{params[:q]}"
@results = JSON.parse(data)

@query = params[:q]

haml :search
end
With that, I have all of my examples passing:
cstrom@jaynestown:~/repos/eee-code$ rake
(in /home/cstrom/repos/eee-code)
.......

Finished in 0.095031 seconds

7 examples, 0 failures
................

Finished in 0.014986 seconds

16 examples, 0 failures
...............................

Finished in 0.178643 seconds

31 examples, 0 failures
So it's back out to the feature. I wrote the step implementation last night describing what should happen on the last page of the results. That is what clued me in to the fact that the Sinatra app was not driving couchdb-lucene pagination. Tonight, I can use that same implementation to verify that this is now working:

The next steps in that scenario ought to all be working, only in need of step definitions to verify. Then it is on to a few boundary conditions.

Tomorrow.

Friday, April 24, 2009

Pagination, Page 3

‹prev | My Chain | next›

Continuing pagination work, I need to get previous and next links working. I also never actually put links in the a tags, so I will get that working as well.

To get the href working, I add an href expectation to the first pagination expectation from last night:
  it "should have a link to other pages" do
pagination('foo', 0, 20, 41).
should have_selector("a",
:content => "2",
:href => "/recipes/search?q=foo&page=2")
end
To drive development of the next / previous link, I write the following examples:
  it "should have a link to the next page if before the last page" do
pagination('foo', 20, 20, 41).
should have_selector("a", :content => "Next »")
end
it "should not have a link to the next page if on the last page" do
pagination('foo', 40, 20, 41).
should have_selector("span", :content => "Next »")
end
it "should have a link to the previous page if past the first page" do
pagination('foo', 20, 20, 41).
should have_selector("a", :content => "« Previous")
end
it "should not have a link to the next page if on the first page" do
pagination('foo', 0, 20, 41).
should have_selector("span", :content => "« Previous")
end
Working through each of these examples, I end up with the following, longish implementation:
    def pagination(query, skip, limit, total)
last_page = (total + limit - 1) / limit
current_page = skip / limit + 1

link = "/recipes/search?q=#{query}"

links = []

links <<
if current_page == 1
%Q|<span class="inactive">« Previous</span>|
else
%Q|<a href="#{link}&page=#{current_page - 1}">« Previous</a>|
end

links << (1..last_page).map do |page|
%Q|<a href="#{link}&page=#{page}">#{page}</a>|
end

links <<
if current_page == last_page
%Q|<span class="inactive">Next »</span>|
else
%Q|<a href="#{link}&page=#{current_page + 1}">Next »</a>|
end

%Q|<div class="pagination">#{links.join}</div>|
end
I can DRY that up some, especially the conditionals around the previous / next links. For now, I will get it done and worry about doing it right another day. At least the repetition is small and all contained within a single method.

The addition of the query parameter to the pagination helper requires a bunch of clean-up in both specs and code. Once that is done, it is back on out to the cucumber feature:
    Scenario: Paginating results

Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see 20 results
And I should see 3 pages of results
And I should not be able to go to a previous page
When I click page 3
Then I should see 10 results
And I should not be able to go to a next page
When I click the previous page
Then I should see 20 results
And I should be able to go to a previous page
When I click the next page
Then I should see 10 results
When I visit page -1
Then I should see page 1
When I visit page "foo"
Then I should see page 1
When I visit page 4
Then I should see page 1
To verify that I should not be able to go to a previous page, I use the following:
Then /^I should not be able to go to a previous page$/ do
response.should have_selector(".pagination span", :content => "« Previous")
end
Running the spec, I find my final failure for today:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Paginating results"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Paginating results
Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see 20 results
And I should see 3 pages of results
And I should not be able to go to a previous page
When I click page 3
Then I should see 10 results
expected following output to contain a <table a/> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
<tr>
<th>Name</th>
<th>Date</th>
</tr>
<tr class="row0">
<td>
<a href="/recipes/id-0-yummy">yummy recipe 0</a>
</td>
<td>2009-04-22</td>
</tr>
<tr class="row1">
<td>
<a href="/recipes/id-1-yummy">yummy recipe 1</a>
</td>
<td>2009-04-22</td>
</tr>
<tr class="row0">
...
Still on recipe 0? Oops, I have yet to connect the page parameters to the RestClient calls to couchdb-lucene.

I do not mind stopping with a failing test. Quite the opposite, I know exactly where to start tomorrow.

Thursday, April 23, 2009

Pagination, Page 2

‹prev | My Chain | next›

Continuing work on pagination, I need to get page links working. The place to do this is in a pagination helper. I will stick close to the couchdb-lucene API by passing in three arguments to the pagination helper: the record offset, the page size (called limit by couchdb-lucene), and the total number of records.

My first two examples:
describe "pagination" do
it "should have a link to other pages" do
pagination(0, 20, 41).
should have_selector("a", :content => "2")
end
it "should have 3 pages, when results.size > 2 * page size" do
pagination(0, 20, 41).
should have_selector("a", :content => "3")
end
end
Writing just enough code to implement these two examples, I end up with this:
    def pagination(skip, limit, total)
total_pages = (total + limit - 1) / limit

links = (1..total_pages).map do |page|
%Q|<a href="">#{page}</a>|
end

%Q|<div class="pagination">#{links.join}</div>|
end
The next example I write explores the boundary condition of the total number of results being exactly divisible by the page size:
  it "should have only 2 pages, when results.size == 2 * page size" do
pagination(0, 20, 40).
should_not have_selector("a", :content => "3")
end
When I run it, I get a failure indicating that I don't quite have my boundaries set correctly:
cstrom@jaynestown:~/repos/eee-code$ spec ./spec/eee_helpers_spec.rb
...........F

1)
'pagination should have only 2 pages, when results.size == 2 * page size' FAILED
expected following output to omit a <a>3</a>:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div class="pagination">
<a href="">1</a><a href="">2</a><a href="">3</a>
</div></body></html>
./spec/eee_helpers_spec.rb:82:

Finished in 0.014629 seconds

12 examples, 1 failure
Ah, a fence post is being counted. It can be removed by subtracting 1:
    def pagination(skip, limit, total)
total_pages = (total + limit - 1) / limit

links = (1..total_pages).map do |page|
%Q|<a href="">#{page}</a>|
end

%Q|<div class="pagination">#{links.join}</div>|
end
I notice that I have no links in those hrefs. That will have to wait until tomorrow though.
(commit)

Wednesday, April 22, 2009

Pagination, Page 1

‹prev | My Chain | next›

I still have several stray scenarios in need of some Given/When/Thens. First to receive the full Cucumber treatment is "Paginating results", which I give form as:
    Scenario: Paginating results

Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see 20 results
And 3 pages of results
And I should not be able to go to a previous page
When I visit page 3
Then I should see 10 results
And I should not be able to go to a next page
When I visit the previous page
Then I should see 20 results
And I should be able to go to a previous page
When I visit the next page
Then I should see 10 results
When I visit page -1
Then I should see page 1
When I visit page "foo"
Then I should see page 1
When I visit page 4
Then I should see page 1
(commit)

That is a pretty complete description of pagination, including navigating via the next / previous link, the page number and even some exploration of boundary conditions.

Implementing the Given 50 yummy recipes step is a breeze with using ruby's range operator:
Given /^(\d+) (.+) recipes$/ do |count, keyword|
date = Date.new(2009, 4, 22)

(0..count.to_i).each do |i|
permalink = "id-#{i}-#{keyword.gsub(/\W/, '-')}"

@pancake_recipe = {
:title => "#{keyword} recipe #{i}",
:date => date
}

RestClient.put "#{@@db}/#{permalink}",
@pancake_recipe.to_json,
:content_type => 'application/json'
end
end
That step will create zero through count (non-inclusive) recipes with titles containing the keyword specified in the Cucumber text. Since each document contains that text, searching for that keyword will return all of them—perfect for pagination testing.

Running the scenario should now get through the first 3 steps, but...
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Paginating results"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Paginating results
Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
No such file or directory - /home/cstrom/.gem/ruby/1.8/gems/polyglot-0.2.5/lib/views/search.haml (Errno::ENOENT)
./features/support/../../eee.rb:23:in `GET /recipes/search'
(eval):7:in `get'
features/recipe_search.feature:56:in `When I search for "yummy"'
...
This is what I get for upgrading gems. Fortunately I remember which gems I upgraded today: webrat. That makes is relatively easy to track down this thread and the fix, adding this to features/support/env.rb:
# Force the application name because polyglot breaks the auto-detection logic.
Sinatra::Application.app_file = File.join(File.dirname(__FILE__), *%w[.. .. eee.rb])
With the first three steps passing, next up is Then I should see 20 results step, which can be defined as:
Then /^I should see (\d+) results$/ do |count|
response.should have_selector("table a", :count => count.to_i)
end
When run, this step fails with the less than helpful error message:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Paginating results"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Paginating results
Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see 25 results
expected following output to contain a <table a/> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table>
<tr>
<th>Name</th>
<th>Date</th>
</tr>
<tr class="row0">
<td>
<a href="/recipes/id-0-yummy">yummy recipe 0</a>
</td>
<td>2009-04-22</td>
</tr>
...
Clearly, the HTML does contain "a" tags inside a "table". The expectation string representation is the first argument of have_selector ("table a"), but the actual expectation in this case is more specific—a specific number of those selectors are expected. It is that specific case that is not being met here.

To rectify, I need to make a code change. This means that it is time to work my way inside. The couchdb-lucene add-on specifies the page size with the limit query option. The change that I want is to ensure that the couchdb-lucene query string matches limit=20:
    it "should have pages sizes of 20 records" do
RestClient.should_receive(:get).
with(/limit=20/).
and_return('{"total_rows":1,"rows":[]}')

get "/recipes/search?q=title:eggs"
end
Writing the code that implements this example is simple enough—add limit=20 to the RestClient query of couchdb-lucene:
get '/recipes/search' do
data = RestClient.get "#{@@db}/_fti?limit=20&q=#{params[:q]}"
@results = JSON.parse(data)

haml :search
end
That breaks some older examples that were a bit too specific. Loosening their expectations (e.g. matching "q=title:eggs" instead of the exact string "#{@@db}/_fti?q=eggs") resolves those failures.

This also fixes the cucumber failure:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Paginating results"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Paginating results
Given 50 yummy recipes
And a 0.5 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see 20 results
And 3 pages of results
And I should not be able to go to a previous page
When I click page 3
Then I should see 10 results
And I should not be able to go to a next page
When I click the previous page
Then I should see 20 results
And I should be able to go to a previous page
When I click the next page
Then I should see 10 results
When I visit page -1
Then I should see page 1
When I visit page "foo"
Then I should see page 1
When I visit page 4
Then I should see page 1

1 scenario
3 skipped steps
13 undefined steps
4 passed steps
(commit)

Four steps down, 13 to go. That is as good a stopping point as any for today. I will pick things up with the next pending spec tomorrow.

Tuesday, April 21, 2009

Searching Deep Data Structures

‹prev | My Chain | next›

On tap for tonight, the "Searching Ingredients" scenario:
    Scenario: Searching ingredients

Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it and a summary of "does not go well with chocolate"
And a 0.5 second wait to allow the search index to be updated
When I search ingredients for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results
This is another fielded search like yesterday's Searching titles, so it ought to be easy, right?

Turns out not so much.

The cause of the trouble can be seen in the (new) step definition that implements Given a "french toast" recipe with "eggs" in it and a summary of "does not go well with chocolate":
Given /^a "(.+)" recipe with "(.+)" in it and a summary of "(.+)"$/ do |title, ingredient, summary|
date = Date.new(2009, 4, 21)
permalink = "id-#{title.gsub(/\W/, '-')}"

@pancake_recipe = {
:title => title,
:date => date,
:summary => summary,
:preparations => [
{
'ingredient' => {
'name' => ingredient
}
}
]
}

RestClient.put "#{@@db}/#{permalink}",
@pancake_recipe.to_json,
:content_type => 'application/json'
end
The ingredient ("eggs" as set in the scenario text) is buried inside a single "preparation". The current indexing algorithm would index the ingredient name not in the 'ingredient' field, but in the 'name' field (because it is indexed by key). Just to double-check, I implement the When I search ingredients for "chocolate" step as:
When /^I search ingredients for "(.+)"$/ do |keyword|
visit("/recipes/search?q=ingredient:#{keyword}")
end
Running cucumber fails, as expected:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Searching ingredients"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Searching ingredients
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it and a summary of "does not go well with chocolate"
And a 0.5 second wait to allow the search index to be updated
When I search ingredients for "chocolate"
Then I should see the "pancake" recipe in the search results
expected following output to contain a <a href='/recipes/id-pancake'>pancake</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr>
<th>Name</th>
<th>Date</th>
</tr></table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:127:in `Then /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:49:in `Then I should see the "pancake" recipe in the search results'
And I should not see the "french toast" recipe in the search results


1 scenario
4 steps passed
1 step failed
1 step skipped
I do not want to search for "name:eggs", I want to search for "ingredient:eggs", so I have some work to do with the couchdb-lucene indexer. Fortunately, it is not too much work. All that is needed is a special case to handle ingredients:
          /* Handle ingredients as a special case */
if (key == 'preparations') {
var ingredients = [];
for (var i=0; i<obj[key].length; i++) {
ingredients.push(obj[key][i]['ingredient']['name']);
}
ret.field('ingredient', ingredients.join(', '), 'yes');
}
This block builds up the ingredients array with each of the ingredient names. It then takes that array, joins it with commas and indexes the result in the 'ingredient' field. For good measure it stores this value in the index for easy retrieval.

Stepping all the way back out to the cucumber test, everything now passes:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Searching ingredients"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Searching ingredients
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it and a summary of "does not go well with chocolate"
And a 0.5 second wait to allow the search index to be updated
When I search ingredients for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results


1 scenario
6 steps passed
(commit)

I have the feeling I am missing another field or two, but that does it for the defined search scenarios. There are four more search scenarios that need Given/When/Then text descriptions (and subsequent implementation) so I will get started on that tomorrow.

Monday, April 20, 2009

A Better Default Search Field

‹prev | My Chain | next›

When you search for documents containing the word "chocolate" with Google, you enter "chocolate" as the search term. When use Google to find documents containing the word "chocolate" on a particular site, say http://eeecooks.com, you would enter "site:eeecooks.com chocolate".

Because this is how Google works, this is how search works.

But this is not how the current seach in eee-code works. To search for a recipe with "chocolate" in it and a title that contains "pancake", I currently have to query couchdb-lucene with a search of "title:pancake all:chocolate". Yesterday, I started down the path of trying to pre-process the search query. Today, I think better of it.

Lucene's QueryParser supports a default field argument in its constructor. If we supply "all" as the default field, which is possible in couchdb-lucene in src/main/java/com/github/rnewson/couchdb/lucene/Config.java:
    static final QueryParser QP = new QueryParser("all", ANALYZER);
Then the QueryParser interprets "title:pancake chocolate" to be identical to "title:pancake all:chocolate".

Just to be sure, give curl a try with the old standby of "wheatberries" (and "all:wheatberries"):
cstrom@jaynestown:~/repos/eee-code$ curl http://localhost:5984/eee/_fti?q=all:wheatberries
{"q":"+_db:eee +all:wheatberri",
"etag":"120c60536a7",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":1,
"fetch_duration":1,
"rows":[
{"_id":"2008-07-19-oatmeal",
"date":"2008/07/19",
"title":"Multi-grain Oatmeal",
"score":0.5710114240646362
}]
}
cstrom@jaynestown:~/repos/eee-code$ curl http://localhost:5984/eee/_fti?q=wheatberries
{"q":"+_db:eee +all:wheatberri",
"etag":"120c60536a7",
"skip":0,"limit":25,
"total_rows":1,
"search_duration":0,
"fetch_duration":1,
"rows":[
{"_id":"2008-07-19-oatmeal",
"date":"2008/07/19",
"title":"Multi-grain Oatmeal",
"score":0.5710114240646362
}]
}
Note that both queries are both interpreted as "+_db:eee +all:wheatberri"—both use the "all" field to scope the the search even though the second does not explicitly include it.

Also of note is that "wheatberri" is the Porter stem of "wheatberries" (this stemming was explicitly set a few days ago). The "_db" field is how couchdb-lucene works with multiple databases. All documents from all databases (e.g. the recipe documents in the development and test databases) are all stored in the same index. Couchdb-lucene automatically infers the db parameter from the database being queried ("eee" in the above examples). Using this parameter, couchdb-lucene only searches for documents in the current database, effectively limiting search even though the search index is not similarly limited.
(commit)

With that in place, I can back out the workaround from yesterday, leaving the search action much simpler:
get '/recipes/search' do
data = RestClient.get "#{@@db}/_fti?q=#{params[:q]}"
@results = JSON.parse(data)

haml :search
end
Next up: searching ingredients and then onto paginating and sorting (which couchdb-lucene supports out of the box).
(commit)

Sunday, April 19, 2009

Field Search: Searching on titles

‹prev | My Chain | next›

The next scenario up is "Searching titles", which is described in Cucumber as:
    Scenario: Searching titles

Given a "pancake" recipe
And a "french toast" recipe with a "not a pancake" summary
And a 0.25 second wait to allow the search index to be updated
When I search titles for "pancake"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results
The Given a-recipe-with-summary step already has a step definition. The Given a-recipe-with-a-title step needs a definition:
Given /^a "(.+)" recipe$/ do |title|
date = Date.new(2009, 4, 19)
permalink = "id-#{title.gsub(/\W/, '-')}"

recipe = {
:title => title,
:date => date,
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'
end
The next step is When I search titles for "pancake", which can be defined as:
When /^I search titles for "(.+)"$/ do |keyword|
visit("/recipes/search?q=title:#{keyword}")
end
The only difference between this and the already defined When I search for "foo" is the addition of the title query parameter. Attempting to run this query, however results in a brutal RestClient failure:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Searching titles"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Searching titles
Given a "pancake" recipe
And a "french toast" recipe with a "not a pancake" summary
And a 0.25 second wait to allow the search index to be updated
When I search titles for "pancake"
HTTP status code 400 (RestClient::RequestFailed)
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:144:in `process_result'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:106:in `transmit'
/usr/lib/ruby/1.8/net/http.rb:543:in `start'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:103:in `transmit'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:36:in `execute_inner'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:28:in `execute'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient/request.rb:12:in `execute'
/home/cstrom/.gem/ruby/1.8/gems/rest-client-0.9.2/lib/restclient.rb:57:in `get'
./features/support/../../eee.rb:20:in `GET /recipes/search'
/home/cstrom/.gem/ruby/1.8/gems/sinatra-0.9.1.1/lib/sinatra/base.rb:696:in `call'
...
(continues for quite a while)
RestClient errors warrant a peak in the CouchDB log, where I find:
[info] [<0.3573.3>] 127.0.0.1 - - 'GET' /eee-test/_fti?q=all:title:pancake 400
We are getting an HTTP 400 / Bad Request response because the search itself is invalid. Lucene does fielded searches by prepending the field name to the search term, separated by a colon. Similar to how Google does it (e.g. "site:eeecooks.com spinach"), a lucene search for a recipe with the word "pancake" in the title would be searched for as "title:pancake". It makes no sense to smush two fields togther as we have, "all:title:pancake". Hence the 400 response.

It is probably a good thing that an invalid search returns an invalid (400) HTTP response code as opposed to some other code. Still, I should investigate a bit more later, so I make a note for myself to do so in the form of a step-less scenario:
    Scenario: Invalid search parameters
Getting back to the current failing step, it is time to move inside the feature.

A second example for "/recipes/search" will describe the new, desired behavior:
    it "should not include the \"all\" field when performing fielded searches" do
RestClient.should_receive(:get).
with("#{@@db}/_fti?q=title:eggs").
and_return('{"total_rows":1,"rows":[]}')

get "/recipes/search?q=title:eggs"
end
The original example is only slightly different, defaulting to the "all" field that we are using to index entire documents:
    it "should retrieve search results from couchdb-lucene" do
RestClient.should_receive(:get).
with("#{@@db}/_fti?q=all:eggs").
and_return('{"total_rows":1,"rows":[]}')

get "/recipes/search?q=eggs"
end
The first time I run the spec, the new example fails:
cstrom@jaynestown:~/repos/eee-code$ spec ./spec/eee_spec.rb
....F

1)
Spec::Mocks::MockExpectationError in 'eee GET /recipes/search should not include the "all" field when performing fielded searches'
RestClient expected :get with ("http://localhost:5984/eee-test/_fti?q=title:eggs") but received it with ("http://localhost:5984/eee-test/_fti?q=all:title:eggs")
./eee.rb:20:in `GET /recipes/search'
/home/cstrom/.gem/ruby/1.8/gems/sinatra-0.9.1.1/lib/sinatra/base.rb:696:in `call'
...
The easiest way to fix the error is to remove the double fields:
get '/recipes/search' do
query = "all:#{params[:q]}".sub(/(\w+):(\w+):/, "\\2:")
data = RestClient.get "#{@@db}/_fti?q=#{query}"
@results = JSON.parse(data)

haml :search
end
Now the specification passes:
cstrom@jaynestown:~/repos/eee-code$ spec ./spec/eee_spec.rb
.....

Finished in 0.079155 seconds

5 examples, 0 failures
With the inside, detailed specification passing, I try the outside specification and it works:
  So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Searching titles
Given a "pancake" recipe
And a "french toast" recipe with a "not a pancake" summary
And a 0.25 second wait to allow the search index to be updated
When I search titles for "pancake"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results


1 scenario
6 steps passed
I have some reservations about this particular simplest solution. The edge cases of parsing search queries are many. I will worry about that another day. Maybe even tomorrow.
(commit)

Saturday, April 18, 2009

One Step Back

‹prev | My Chain | next›

Looking for the next scenario on which to work, I run all scenarios:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word in the ingredient list in full recipe search
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
And a 1 second wait to allow the search index to be updated
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results

Scenario: Matching a word in the recipe summary
Given a "pancake" recipe with a "Yummy!" summary
And a "french toast" recipe with a "Delicious" summary
And a 1 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see the "pancake" recipe in the search results
expected following output to contain a <a href='/recipes/id-pancake'>pancake</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr>
<th>Name</th>
<th>Date</th>
</tr></table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `Then /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:22:in `Then I should see the "pancake" recipe in the search results'
And I should not see the "french toast" recipe in the search results

Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
expected following output to contain a <a href='/recipes/id-french-toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr>
<th>Name</th>
<th>Date</th>
</tr></table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `And /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:32:in `And I should see the "french toast" recipe in the search results'

...
7 scenarios
2 scenarios pending
16 steps passed
2 steps failed
7 steps skipped
5 steps pending (5 with no step definition)
Hunh? I just got those two scenarios passing, why are they failing?

After much trial and error investigation, I find that couchdb-lucene (or CouchDB itself) requires some settle time in between database tear-down and puts. I am not happy with that as an explanation, but it serves as an accurate observation. At time permits, I will go back and dig up a satisfactory explanation.

Back when I first got cucumber to drive Sinatra / CouchDB tests, I added the database puts and tear-downs as Before and After blocks to features/support/env.rb. Experimentation proves that sleeping for half a second after a tear-down resolves the problem:
After do
RestClient.delete @@db
sleep 0.5
end
Lower sleep times work sporadically, while half a second always seems to work.

I did try to add a sleep to the beginning of the scenarios—something along the lines of "Given a 0.5 second wait for the newly created database to settle". That always failed. This leads me to believe that there is some thread in CouchDB that is not awakening in between rapid deletes and recreates of the database. The end result being that couchdb-lucene is completely unaware that it needs to re-index.

To ease the blow of having to wait 0.5 seconds after every tear-down (after every scenario), I also tweak the sleep that I added in the search scenarios (e.g. "Given a 1 second wait to allow the search index to be updated"). That delay was added to allow couchdb-lucene to index newly added documents (as opposed to recognizing newly created databases). Rather than wait a full second for the newly created documents to be indexed, I change them to wait 0.25 seconds. To accommodate this change, the step definition needs to honor floats rather than integers:
Given /^a ([.\d]+) second wait/ do |seconds|
sleep seconds.to_f
end
Et violà, all the finished scenarios pass when run together:
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word in the ingredient list in full recipe search
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
And a 0.25 second wait to allow the search index to be updated
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results

Scenario: Matching a word in the recipe summary
Given a "pancake" recipe with a "Yummy!" summary
And a "french toast" recipe with a "Delicious" summary
And a 0.25 second wait to allow the search index to be updated
When I search for "yummy"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results

Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 0.25 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results

Scenario: Searching titles
...
7 scenarios
2 scenarios pending
19 steps passed
6 steps skipped
5 steps pending (5 with no step definition)

(commit)

Fork couchdb-lucene

I forked Robert Newson's excellent couchdb-lucene to include my stemming analyzer. I wanted to be able to track his changes (of which there are many), while keeping my changes in there. Forking make this easy.

Forking may also afford a chance to investigate how to get it working better with quick turnaround tear-down / database puts that are needed in cucumber scenarios. But that is for another day. Maybe.

Friday, April 17, 2009

Stem Searching with couchdb-lucene

‹prev | My Chain | next›

Next up in my scenarios is Matching a word stem in the recipe instructions. Word stems reduce words to their lowest common denominator so that searching for the word "whisk" will match documents containing the word "whisking".

The entire scenario:
    Scenario: Matching a word stem in the recipe instructions

Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
As with the last scenario, there are relatively few steps that need to be implemented anew. The Given a recipes with instructions step can be implemented thusly:
Given /^a "(.+)" recipe with instructions "(.+)"$/ do |title, instructions|
date = Date.new(2009, 4, 16)
permalink = "id-#{title.gsub(/\W/, '-')}"

recipe = {
:title => title,
:date => date,
:instructions => instructions
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'
end
This is really starting to look familiar. My red-green-refactor cycle may need a little more refactor. Another day.

With that in place, I have but one failure remaining:
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
expected following output to contain a <a href='/recipes/id-french toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr>
<th>Name</th>
<th>Date</th>
</tr></table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `And /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:32:in `And I should see the "french toast" recipe in the search results'


1 scenario
5 steps passed
1 step failed
This failure shows that no recipes are showing up in the search results, which means that stemming is not being used in couchdb-lucene. Inspecting src/main/java/com/github/rnewson/couchdb/lucene/Config.java, one can see that it uses the (non-stemming) StandardAnalyzer:
...
final class Config {

static final Analyzer ANALYZER = new StandardAnalyzer();
...
}
To get it using using a custom (stemming) analyzer, create src/main/java/com/github/rnewson/couchdb/lucene/MyAnalyzer.java:
package com.github.rnewson.couchdb.lucene;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;

class MyAnalyzer extends Analyzer {
public final TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(new LowerCaseTokenizer(reader));
}
}
There is nothing fancy in there—it is take directly from the lucene documentation. Then, change the configuration to use MyAnalyzer:
...
final class Config {

static final Analyzer ANALYZER = new MyAnalyzer();

...
}
Finally compile the jar files with maven by invoking mvn. My local development version of CouchDB is already pointing to the compiled jar, so all I need to is start it up with ./utils/run and re-run cucumber:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Matching a word stem in the recipe instructions"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results
expected following output to contain a <a href='/recipes/id-french toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table>
<tr>
<th>Name</th>
<th>Date</th>
</tr>
<tr class="row0">
<td>
<a href="/recipes/id-french-toast">french toast</a>
</td>
<td>2009-04-16</td>
</tr>
</table></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:82:in `And /^I should see the "(.+)" recipe in the search results$/'
features/recipe_search.feature:32:in `And I should see the "french toast" recipe in the search results'


1 scenario
5 steps passed
1 step failed
Hunh?! The french toast recipe (that requires "whisking") is now showing up in the search results, why is it failing?

Ah nuts, the link being tested for is missing a dash. Add a gsub to the step:
Then /^I should see the "(.+)" recipe in the search results$/ do |title|
response.should have_selector("a",
:href => "/recipes/id-#{title.gsub(/\W/, '-')}",
:content => title)
end
And we have verified stemming working!
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n -s "Matching a word stem in the recipe instructions"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word stem in the recipe instructions
Given a "pancake" recipe with instructions "mixing together dry ingredients"
And a "french toast" recipe with instructions "whisking the eggs"
And a 1 second wait to allow the search index to be updated
When I search for "whisk"
Then I should not see the "pancake" recipe in the search results
And I should see the "french toast" recipe in the search results


1 scenario
6 steps passed

(commit)

Update: I forked couchdb-lucene so that I could continue to use the stemming analyzer, while still tacking changes to the master.

Thursday, April 16, 2009

A Quicky Scenario

‹prev | My Chain | next›

Up next in my chain is the search for a keyword in the recipe summary. The scenario:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature \
-n -s "Matching a word in the recipe summary"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word in the recipe summary
Given a "pancake" recipe with a "Yummy!" summary
And a "french toast" recipe with a "Delicious" summary
When I search for "yummy"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results


1 scenario
2 steps skipped
3 steps pending (3 with no step definition)

You can use these snippets to implement pending steps which have no step definition:

Given /^a "pancake" recipe with a "Yummy!" summary$/ do
end

Given /^a "french toast" recipe with a "Delicious" summary$/ do
end

When /^I search for "yummy"$/ do
end
The two Then steps (for the "pancake" and "french toast" recipes) look very similar. I implement the "pancake" one first, then generalize it to work with with both:
Given /^a "(.+)" recipe with a "(.+)" summary$/ do |title, keyword|
date = Date.new(2009, 4, 12)
permalink = "id-#{title.gsub(/\W/, '-')}"

recipe = {
:title => title,
:date => date,
:summary => "This is #{keyword}"
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'
end
The next Given step, the 1 second wait, is already complete.

The When step looks suspiciously like the When step from last night's work (When I search for "chocolate"). To make that step definition work for both, it can be quickly reworked as:
When /^I search for "(.+)"$/ do |keyword|
visit("/recipes/search?q=#{keyword}")
end
I get the first Then step for free thanks to my violation of YAGNI yesterday. Luckily I did need it this time!

Last up for this scenario is specifying that the search results should not include the french toast recipe (because it does not have the word "yummy" in it). I implement this step similarly to the should have step:
Then /^I should not see the "(.+)" recipe in the search results$/ do |title|
response.should_not have_selector("a", :content => title)
end
This is slightly less specific than the positive match (not matching the href). When matching things, you should do everything possible to find the right element—be as specific as possible. When verifying the absence of something, you should also do everything possible to verify that it is not there—by being a little less specific.

That does it. An entire scenario finished without adding any code, which means that couchdb-lucene indexing code from the other night implements these steps. The lack of unit tests covering that indexing code makes these scenario steps that much more important for verification of the functionality.
(commit)

Wednesday, April 15, 2009

Inside-out with couchdb-lucene

‹prev | My Chain | next›

With couchdb-lucene returning data along with results, I get my red-green-refactor on tonight to finish implementing the first Recipe Search scenario.

My initial effort on this ended with the search action responding with a simple string. To get full output, a template is needed. The spec doc that I end up implementing is:
cstrom@jaynestown:~/repos/eee-code$ spec ./spec/views/search.haml_spec.rb  -cfs

search.haml
- should display the recipe's title
- should display a second recipe
- should display zebra strips
- should link the title to the recipe
- should display the recipe's date

Finished in 0.031609 seconds

5 examples, 0 failures
Check the commit if you are interested in the details of the individual specs. The Haml template that implements these 5 examples is still relatively simple at this point:
%table
%tr
%th= "Name"
%th= "Date"
- @results['rows'].each_with_index do |result, i|
%tr{:class => "row#{i % 2}"}
%td
%a{:href => "/recipes/#{result['_id']}"}= result['title']
%td= result['date']
Finally, working my way back out to the scenario I perform some accidental refactoring. The scenario that I need to implement is:
    Scenario: Matching a word in the ingredient list in full recipe search

Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
And a 1 second wait to allow the search index to be updated
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results
The accidental refactoring took place in the first Then's definition. The original implementation was:
Then /^I should see the "pancake" recipe in the search results$/ do
response.should have_selector("a", :href => "/recipes/#{@pancake_permalink}")
end
The accidental refactoring took place when I misread the last Then statement to be in the same format as the first (I missed the addition of the word "not"). To work with both forms, the block-with-argument step definition works:
Then /^I should see the "(.+)" recipe in the search results$/ do |title|
response.should have_selector("a",
:href => "/recipes/id-#{title}",
:content => title)
end
Chagrined to see that the final step was still not implemented, I correct my omission, but leave the refactored definition in place. This is not a simple violation of YAGNI, because I am going to need it. Upcoming scenarios can use the generalized format. Still, I must be more careful before refactoring.
(commit)

Tuesday, April 14, 2009

Recipe Search Results

‹prev | My Chain | next›

Currently, the search results from couchdb-lucene are of the format:
{
"q":"+_db:eee +all:wheatberries",
"etag":"1209e596ea8",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":2,
"fetch_duration":1,
"rows":[
{
"_id":"2008-07-19-oatmeal",
"score":0.5951423645019531
}
]
}
The _id is enough information to retrieve the recipe. To display things like the recipe title or the date on which it was added to the cookbook, a separate request for each recipe will be needed. On a search results page with 25, 50, or 100 results, that is 25, 50, or 100 separate requests of the CouchDB server. Luckily, that overhead is not necessary with lucene—it is possible to store computed values in the index.

The couchdb-lucene API for accomplishing this is doc.field('key', value, 'yes'), where doc is the lucene document instance, 'key' is the key for the field, value is the value associated with that key, and 'yes' indicates that the value should be stored in addition to indexed.

In the lucene design document that I am using, the local variable doc is the CouchDB record and ret is an instance of a lucene document. To add date and title to the search results, I add the following code:
  ret.field('date',  doc['date'],  'yes');
ret.field('title', doc['title'], 'yes');
Searching with that design document in place returns:
{"q":"+_db:eee +all:wheatberries",
"etag":"1209e596eaa",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":1,
"fetch_duration":1,
"rows":[
{
"_id":"2008-07-19-oatmeal",
"date":"2008/07/19",
"title":"Multi-grain Oatmeal",
"score":0.6080121994018555
}
]
}
Now I can display the date, title, and link to the recipe in the recipe search results without the need for a separate request.
(commit, adding this to the test DB)

Monday, April 13, 2009

Cucumber and couchdb-lucene

‹prev | My Chain | next›

Yesterday, I ended work seemingly unable to search a couchdb-lucene index via Cucumber. At the time, I suspected the design document of being the source of the trouble. In my spike, I had created the design document via the futon interface to CouchDB. That is not an option in an automated test, so I used RestClient to put the design document.

Brief investigation revealed that this was not the problem. Creating the design document in the development database via RestClient worked as expected.

The source of the trouble turns out to be the frequency with which couchdb-lucene updates its index. Fortunately, couchdb-lucene exposes configuration for the time it will wait after updates before re-indexing. The configuration value that controls this is couchdb.lucene.commit.min. This is set in the couchdb.ini file, or, in the case of running locally, the local_dev.ini file:
;; Do NOT set couchdb.lucene.commit.min this low in production!!!
[update_notification]
indexer=/usr/bin/java -Dcouchdb.lucene.commit.min=50 -jar /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-0.3-SNAPSHOT-jar-with-dependencies.jar -index
This sets the amount of time after an update until a re-indexing to 50 milliseconds. This is much better for testing purposes that the default, which is 10 seconds.

There is a similar couchdb.lucene.commit.max configuration value. Unlike the min setting, it will not wait for updates to complete once this setting is reached—after couchdb.lucene.commit.max milliseconds, couchdb-lucene will re-index regardless. In practice, this setting was not as effective as the min.

A delay is going to be needed in the cucumber scenarios. Something like this ought to do:
      And a 5 second wait to allow the search index to be updated
I may end up tweaking that, so I make the amount of time to wait configurable:
Given /^a (\d+) second wait to allow the search index to be updated$/ do |seconds|
sleep seconds.to_i
end
With that in place, my first search scenario now passes:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Matching a word in the ingredient list in full recipe search"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word in the ingredient list in full recipe search
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
And a 5 second wait to allow the search index to be updated
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results


1 scenario
5 steps passed
1 step pending (1 with no step definition)

You can use these snippets to implement pending steps which have no step definition:

Then /^I should not see the "french toast" recipe in the search results$/ do
end
I do tweak the wait down to 1 second, without introducing race-condition-like failure. At some point, I may need a sub-second wait, but I can live with a second delay for now.

Sunday, April 12, 2009

Implementing Recipe Search, Part 1

‹prev | My Chain | next›

With recipe show / details done, next up is recipe search. A pair of spikes helped me to understand how I might implement full text searching. Now it is time to do it for real.

The feature, as described in Cucumber format:
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
The first scenario, also in cucumber format:
    Scenario: Matching a word in the ingredient list in full recipe search

Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
And I should not see the "french toast" recipe in the search results
It is a good thing I finally figured out full document indexing the other day, otherwise I might have to defer this particular scenario to another day.

The first to Given steps are easy enough to implement at this point—several similar ones were needed during the recipe details feature work. By way of illustration, the a "french toast" recipe with "eggs" in it step is implemented in the newly created features/step_definitions/recipe_search.rb as:
Given /^a "french toast" recipe with "eggs" in it$/ do
@date = Date.new(2009, 4, 12)
@title = "French Toast"
@permalink = @date.to_s + "-" + @title.downcase.gsub(/\W/, '-')

recipe = {
:title => @title,
:date => @date,
:preparations => [
{
'quantity' => '1',
'ingredient' => { 'name' => 'egg'}
}
]
}

RestClient.put "#{@@db}/#{@permalink}",
recipe.to_json,
:content_type => 'application/json'
end
Next up is the When I search for "chocolate" step. I need to move into the application in order to implement search. Before moving into the nitty-gritty of the application, I realize that my test DB does not have the full document, full text search definition that I added to my development DB the other night.

Uh-oh.

I am adding something to my test DB that I already added to my development DB. That same something will need to be in my production DB. Sounds like database migrations to me. Ugh. Something for another day. For now I will add it to the the Before block of features/support/env.rb, but will have to address this in the very near future. The Before block with the new code:
Before do
RestClient.put @@db, { }

# TODO need to accomplish this via CouchDB migrations
lucene_index_function = <<_JS
function(doc) {
var ret = new Document();

function idx(obj) {
for (var key in obj) {
switch (typeof obj[key]) {
case 'object':
idx(obj[key]);
break;
case 'function':
break;
default:
ret.field(key, obj[key]);
ret.field('all', obj[key]);
break;
}
}
}

idx(doc);

return ret;
}
_JS

doc = { 'transform' => lucene_index_function }

RestClient.put "#{@@db}/_design/lucene",
doc.to_json,
:content_type => 'application/json'
end
To try this out, I need to implement my When search step. In order to do that, I need to work my way into the code so that I can implement the search.

For the target scenario, the user should be able to search for a term anywhere (title, summary, ingredients) in the recipe document. This is the "all" field that was created the other night. The API that I would like to expose is that if I query for "eggs" in the Sinatra app, it should be passed on as an "all" search to couchdb. The example that describes this behavior:
  describe "GET /recipes/search" do
it "should retrieve search results from couchdb-lucene" do
RestClient.should_receive(:get).
with("#{@@db}/_fti?q=all:eggs").
and_return('{"total_rows":1}')

get "/recipes/search?q=eggs"
end
end
The code that implements this is:
get '/recipes/search' do
data = RestClient.get "#{@@db}/_fti?q=all:#{params[:q]}"
@results = JSON.parse(data)

["results:", @results['total_rows'].to_s]
end
This example does not render any results. The current step that is being implemented is the "When I search...". This example and solution are the simplest things that work. I will worry about the results when I get to the Then steps.

After working my way out to implement the "When I search" step with a simple Webrat visit, I am ready to give the Then steps a try. Without getting into too many details with regards to the output format, I try to implement this example:
    it "should include a link to a match" do
RestClient.should_receive(:get).
with("#{@@db}/_fti?q=all:eggs").
and_return('{"total_rows":1,"rows":[{"_id":"007"}]}')

get "/recipes/search?q=eggs"
response.should have_selector("a", :href => "/recipes/007")
end
With this code:
get '/recipes/search' do
data = RestClient.get "#{@@db}/_fti?q=all:#{params[:q]}"
@results = JSON.parse(data)

["results: #{@results['total_rows']}<br/>"] +
@results['rows'].map do |result|
%Q|<a href="/recipes/#{result['_id']}">title</a>|
end
end
The search results iterate over the "rows" in the results, mapping to the desired recipe link. Without even bothering to get the title included in the output, I pop back out to the cucumber scenario to see if this attempt at a Then step might succeed:
Then /^I should see the "pancake" recipe in the search results$/ do
response.should have_selector("a", :href => "/recipes/#{@pancake_permalink}")
end
Unfortunately, it does not:
cstrom@jaynestown:~/repos/eee-code$ cucumber features/recipe_search.feature -n \
-s "Matching a word in the ingredient list in full recipe search"
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes
Scenario: Matching a word in the ingredient list in full recipe search
Given a "pancake" recipe with "chocolate chips" in it
And a "french toast" recipe with "eggs" in it
When I search for "chocolate"
Then I should see the "pancake" recipe in the search results
expected following output to contain a <a href='/recipes/2009-04-12-buttermilk-chocolate-chip-pancakes'/> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>results: 0<br></p></body></html> (Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:51:in `Then /^I should see the "pancake" recipe in the search results$/'
features/recipe_search.feature:12:in `Then I should see the "pancake" recipe in the search results'
And I should not see the "french toast" recipe in the search results


1 scenario
3 steps passed
1 step failed
1 step pending (1 with no step definition)

You can use these snippets to implement pending steps which have no step definition:

Then /^I should not see the "french toast" recipe in the search results$/ do
end
Hmm. I am not getting any search results back. My best guess is that my lucene design document is not working as expected when uploading via RestClient.

It is late, so I will have to test that guess tomorrow. I do not mind stopping at Red during the Red-Green-Refactor cycle—I know exactly where to pick up tomorrow.
(commit)