Friday, August 7, 2009

Dumping CouchDB Documents with couch_design_docs

‹prev | My Chain | next›

I had planned setting up my VPS tonight, but I forgot the name of the provider that I was going to use (yup, getting old). While struggling to remember, I got to thinking about what I would do once I set up the server. After setting up the CouchDB server, I would definitely want to play with some real data. But how to get that real data?

I do not want to deal with the hassle of binary compatible versions of CouchDB (I think I am still using an old trunk version of CouchDB locally). I would also prefer not to have to do my legacy app dump—it is quite slow locally. Perhaps couch_design_docs can help?

Let's see... If it can solve this problem, that gem ought to be able to iterate over each document in my local database and dump them to the file system. Describing that iteration, in RSpec format:
    it "should be able to load each document" do
Store.stub!(:get).
with("uri/_all_docs").
and_return({ "total_rows" => 2,
"offset" => 0,
"rows" => [{"id"=>"1", "value"=>{}, "key"=>"1"},
{"id"=>"2", "value"=>{}, "key"=>"2"}]})

Store.stub!(:get).with("uri/1")
Store.should_receive(:get).with("uri/2")

@it.each { }
end
The first stubbed method handles pulling back all documents in the database. The second stub handles the get of the first record. The actual expectation in here is that iterating over each document in the store should retrieve the second record. Yah, I could have used two expectations ("should_receives"), but I really prefer on expectation per example.

An any rate, to get that example to pass:
    def each
Store.get("#{url}/_all_docs")['rows'].each do |rec|
yield Store.get("#{url}/#{rec['id']}")
end
end
If I have an each method, I might as well mixin Enumerable to get access to sweet methods like reject, select, all?, etc.:
module CouchDesignDocs
class Store
include Enumerable
#...
end
end
As I iterate over each document, I will need to store each on the file system. This seems to be a reasonable responsibility of the DocumentDirectory class. An example of this, in RSpec:
    it "should be able to save a document as JSON" do
file = mock("File", :close => true)
File.stub!(:new).and_return(file)

file.should_receive(:write).with(%Q|{"_id":"foo"}|)

@it.store_document({'_id' => 'foo'})
end
To make this example pass:
module CouchDesignDocs
class DocumentDirectory

attr_accessor :couch_doc_dir
#...
def store_document(doc)
file = File.new("#{couch_doc_dir}/#{doc['_id']}.json", "w+")
file.write(doc.to_json)
file.close
end
end
end
Putting the CouchDB store and the directory store together, I need to create an instance of each, iterate over the documents in the store, and expect to store the documents in the directory. In RSpec format:
  it "should be able to store all CouchDB documents on the filesystem" do
store = mock("Store")
store.stub!(:each).and_yield({'_id' => 'foo'})
Store.stub!(:new).and_return(store)

dir = mock("Document Directory")
DocumentDirectory.stub!(:new).and_return(dir)

dir.
should_receive(:store_document).
with({'_id' => 'foo'})

CouchDesignDocs.dump("uri", "fixtures")
end
To make that example pass, I add the following class method to CouchDesignDocs (I really need to change the name at this point):
  # Dump all documents located at <tt>db_uri</tt> into the directory
# <tt>dir>/tt>
#
def self.dump(db_uri, dir)
store = Store.new(db_uri)
dir = DocumentDirectory.new(dir)
store.each do |doc|
dir.store_document(doc)
end
end
For good measure, I reject design documents (good thing I mixed in Enumerable):
    it "should be able to store all CouchDB documents on the filesystem" do
@store.stub!(:map).and_yield([{'_id' => '_design/foo'}])
@dir.
should_not_receive(:store_document)

CouchDesignDocs.dump("uri", "fixtures")
end
This example passes with this code:
  def self.dump(db_uri, dir)
store = Store.new(db_uri)
dir = DocumentDirectory.new(dir)
store.
map.
reject { |doc| doc['_id'] =~ /^_design/ }.
each { |doc| dir.store_document(doc) }
end
I am not entirely thrilled with the added map in there. It adds no functionality and only serves to supply the reject with something that it can, uh, reject. It is effectively code solely to support the test, which is just icky. Still, I will not push the issue here—the code is functional and the compiler should optimize the map away.

I install my gem locally and add a rake task to my application:
DB = "http://localhost:5984/eee"
require 'restclient'

namespace :couchdb do
desc "Dump seed data from the database"
task :dump_docs do
CouchDesignDocs.dump(DB, "couch/seed")
end
end
Running that (with timing information because I am curious), I find:
cstrom@jaynestown:~/repos/eee-code$ time rake couchdb:dump_docs
(in /home/cstrom/repos/eee-code)

real 0m4.886s
user 0m2.404s
sys 0m0.440s
That is not bad—less than 5 seconds to dump 1000+ documents.

Examining the filesystem, I see that the documents do, indeed, exist and that they contain JSON:



I do, however, note that I am missing the attachments (need to append ?attachments=true to my RestClient request). I may want to strip the CouchDB revision information from the dumped documents. I definitely want to test uploading the seed data. It may be time to rename the couch_design_docs gem since it does much more than design documents at this point.

These are all things I can do tomorrow.

No comments:

Post a Comment