Monday, September 28, 2009

Prototyping CouchDB Views (Take 2)

‹prev | My Chain | next›

I have CouchDB recipe documents that look like:
{
"_id": "2008-07-10-choccupcake",
"_rev": "1-1305119212",
"prep_time": 10,
"title": "Mini Chocolate Cupcakes with White Chocolate Chips",
"published": true,
//...

"date": "2008-07-10",
"type": "Recipe",
//...
"preparations": [
{
"brand": "",
"quantity": 1,
"unit": "cup",
"order_number": 1,
"description": "",
"ingredient": {
"name": "flour",
"kind": "all-purpose"
}
},
{
"brand": "",
"quantity": 0.5,
"unit": "cup",
"order_number": 2,
"description": "",
"ingredient": {
"name": "sugar",
"kind": "white, granulated"
}
},
//...
],
//...
}
With hundreds of recipes of this form, I would like a CouchDB view to give me a list of recipes by ingredient. Ultimately, I would like to generate a page like this (from the legacy Rails app):



I spiked this approach a while back. Looking back on that spike, I realize that I missed a couple of things: (1) the map function includes recipes that have not been published and (2) the reduce function does not deal with re-reduce (combining intermediate reduced steps).

The map that I used previously was:
function (doc) {
for (var i in doc['preparations']) {
var ingredient = doc['preparations'][i]['ingredient']['name'];
var value = [doc['_id'], doc['title']];
emit(ingredient, value);
}
}
That map function reads: for each preparation instruction, pull out the ingredient name and the document ID/title. The former is used to key the map-reduce. The latter will be used to create the links on the web page. As I mentioned this map function includes unpublished recipes. To exclude them, I need to add a simple conditional:
function (doc) {
if (doc['published']) {
for (var i in doc['preparations']) {
var ingredient = doc['preparations'][i]['ingredient']['name'];
var value = [doc['_id'], doc['title']];
emit(ingredient, value);
}
}
}
That produces results like this for the ingredient "apples":
"apples":
["2002-09-18-gingerapple", "Ginger Apple Crisps"],
"apples":
["2003-03-11-pecan", "Pecan Apple Tart"],
"apples":
["2003-12-07-applesauce", "Applesauce"],
"apples":
["2004-11-25-apple", "Apple Pie"]
To reduce that to a list of recipe IDs/title grouped by the "apples" ingredient, I had been using this function:
function(keys, values, rereduce) {
return values;
}
This works for ingredients that are only in a few recipes (like "apples"):
"apple":[["2004-11-25-apple", "Apple Pie"],
["2003-12-07-applesauce", "Applesauce"],
["2003-03-11-pecan", "Pecan Apple Tart"],
["2002-09-18-gingerapple", "Ginger Apple Crisps"]]
For ingredients that are in lots of recipes, a single pass through the map function is not sufficient. Instead, CouchDB generates several arrays of the values (arrays of arrays in this case). Since I am doing nothing different when CouchDB calls my map function with the reduce flag set, I end up returning those arrays of arrays:
"butter": 
[[["2002-02-08-buffalo_chicken", "Buffalo Chicken Sandwich"],
["2002-02-13-mushroom_pasta", "Pasta with Mushroom Gruyere Sauce"],
["2002-02-17-sausage_pie", "Chicken Sausage Pot Pie"],
["2002-03-12-asparagus_omelet", "Asparagus Omelet"],
["2002-03-12-cinnamon_toast", "Cinnamon Toast"],
["2002-03-14-veal_cutlets", "Breaded Veal Scallopini"],
["2002-03-20-mushroom_pasta", "Mushroom Chicken Pasta"],
["2002-04-09-pasta_primavera", "Pasta Primavera"],
["2002-04-13-chicken", "Breaded Baked Chicken"],
["2002-04-15-cajun_shrimp", "Cajun Shrimp"],
["2002-04-30-chicken_stew", "Chicken Andouille Stew"],
["2002-05-19-crabcake", "Maryland Crab Cakes"],
["2002-05-20-eggs", "Scrambled Eggs with Spinach and Bacon"]],
[[["2003-04-14-sandwich", "Batter-Dipped Ham and Cheese Sandwich"],
["2003-04-25-chicken", "Mustard Seed Chicken with Ginger Orange Sauce"],
//...
To avoid this undesirable outcome, I need to handle re-reduces when CouchDB calls my map functiona second time:
function(keys, values, rereduce) {
if (rereduce) {
var ret = [];
for (var i=0; i<values.length; i++) {
ret = ret.concat(values[i]);
}
return ret;
}

else {
return values;
}
}
This produces the results that I desire, a flat array of arrays:
"butter" 
[["2002-02-08-buffalo_chicken", "Buffalo Chicken Sandwich"],
["2002-02-13-mushroom_pasta", "Pasta with Mushroom Gruyere Sauce"],
["2002-02-17-sausage_pie", "Chicken Sausage Pot Pie"],
["2002-03-12-asparagus_omelet", "Asparagus Omelet"],
["2002-03-12-cinnamon_toast", "Cinnamon Toast"],
["2002-03-14-veal_cutlets", "Breaded Veal Scallopini"],
["2002-03-20-mushroom_pasta", "Mushroom Chicken Pasta"],
["2002-04-09-pasta_primavera", "Pasta Primavera"],
["2002-04-13-chicken", "Breaded Baked Chicken"],
["2002-04-15-cajun_shrimp", "Cajun Shrimp"],
["2002-04-30-chicken_stew", "Chicken Andouille Stew"],
["2002-05-19-crabcake", "Maryland Crab Cakes"],
["2002-05-20-eggs", "Scrambled Eggs with Spinach and Bacon"],
["2003-04-14-sandwich", "Batter-Dipped Ham and Cheese Sandwich"],
["2003-04-25-chicken", "Mustard Seed Chicken with Ginger Orange Sauce"],
...
I will stop there for the day. Now that I know the format of the output (and that it will work as I desire), I can drive the implementation of the ingredient index tomorrow.

2 comments:

  1. Your new reduce function collects values in the reduce. The reduce function must actually reduce its input values. Read about it here: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#reduced_value_sizes

    ReplyDelete
  2. Interesting. I must be under the smallish values / large data set threshold mentioned in that link.

    I only have recipe 500 documents in the DB, which map to 5000 ingredients, which are what is actually getting reduced. I was doing all of this work in futon / temporary views and there was not noticeable lag when updating the view (~2 seconds).

    Definitely an O(N) reduce operation. I wonder what it would take to see significant performance degradation...

    ReplyDelete