Friday, April 10, 2009

Oooh! Updated couchdb-lucene

‹prev | My Chain | next›

I started worrying about indexing whole documents as soon as I got couchdb-lucene working with my prototype database. By default, couchdb-lucene indexes attributes individually. This means that the index can be searched for recipes whose titles contain the word "chilled", but there is no way to search for documents that contain the word "chilled" anywhere.

That nagging concern has remained in the intervening weeks, so I was quite excited to see an example of how to do this on Robert Newson's fork of couchdb-lucene.

To get that up and running on my local copy of CouchDB 0.9, I need to update my local copy and rebuild:
cd ~/repos/couchdb-lucene
git pull
mvn
To point my local copy of CouchDB to the new version of the lucene indexer, I update etc/couchdb/local_dev.ini:
; CouchDB Configuration Settings

; Custom settings should be made in this file. They will override settings
; in default.ini, but unlike changes made to default.ini, this file won't be
; overwritten on server upgrade.

[couchdb]
;max_document_size = 4294967296 ; bytes

[httpd]
; port = 5985
;bind_address = 127.0.0.1

[log]
; level = debug

[update_notification]
;unique notifier name=/full/path/to/exe -with "cmd line arg"

[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.

[external]
fti=/usr/bin/java -jar /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-0.3-SNAPSHOT-jar-with-dependencies.jar -search

[update_notification]
indexer=/usr/bin/java -jar /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-0.3-SNAPSHOT-jar-with-dependencies.jar -index

[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
To choose a different indexing algorithm, a new design document is needed. In futon, choose "Design documents" from the "Select view" drop down:



Next, choose "Create Document ..." from the top of the UI, and enter "_design/lucene" for the name:



Finally add the following code (from the couchdb-lucene documentation), which indexes the entire document in the all field:
function(doc) {
var ret = new Document();

function idx(obj) {
for (var key in obj) {
switch (typeof obj[key]) {
case 'object':
idx(obj[key]);
break;
case 'function':
break;
default:
ret.field(key, obj[key]);
ret.field('all', obj[key]);
break;
}
}
}

idx(doc);

return ret;
}
To a new transform attribute (make sure to enclose it in quotes):



Finally, make sure to save the document! I always seem to forget this step which causes all sorts of confusion.

To test the full document, all search, use curl:
cstrom@jaynestown:~$ curl http://localhost:5984/eee/_fti?q=all:wheatberries
{"q":"+_db:eee +all:wheatberries",
"etag":"1202191c3d1",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":1,
"fetch_duration":0,
"rows":[{"_id":"2008-07-19-oatmeal",
"score":0.5710114240646362}]}
Just to be sure that something has not broken after the change, I check the same search on the instructions field, which pulls back the same result ("wheatberries" was mentioned in the instructions of the oatmeal recipe):
cstrom@jaynestown:~$ curl http://localhost:5984/eee/_fti?q=instructions:wheatberries
{"q":"+_db:eee +instructions:wheatberries",
"etag":"1202191c3d1",
"skip":0,
"limit":25,
"total_rows":1,
"search_duration":1,
"fetch_duration":1,
"rows":[{"_id":"2008-07-19-oatmeal",
"score":0.6242526769638062}]}

5 comments:

  1. How did this work without having to specify the design document name?

    ReplyDelete
  2. CouchDB documents do not need names -- just an _id ("_design/lucene" in this case).

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. A Note for couchdb-lucene-0.4: After some hours of rookie-trial and error, I found a combination of design-docs and queries, that worked for me:


    _design/lucene:

    {
    "_id": "_design/lucene",
    "_rev": "9-1003076217",
    "transform": "function(doc) { var ret = new Document(); function idx(obj) { for (var key in obj) { switch (typeof obj[key]) { case 'object': idx(obj[key]); break; case 'function': break; default: ret.field(key, obj[key]); ret.field('all', obj[key]); break; } } } idx(doc); return ret; }",
    "fulltext": {
    "by_title": {
    "defaults": {
    "store": "yes"
    },
    "index": "function(doc) { var ret=new Document(); ret.add(doc.title); return ret }"
    },
    "by_description": {
    "defaults": {
    "store": "no"
    },
    "index": "function(doc) { var ret=new Document(); ret.add(doc.description); return ret }"
    }
    }
    }


    A Query:

    curl http://127.0.0.1:5984/notes_development/_fti/lucene/by_title?q=pop*

    ReplyDelete
  5. here is example for newer version couchdb-lucene

    http://iphylo.blogspot.com/2010/11/couchdb-and-lucene.html

    ReplyDelete