japh(r) by Chris Strom: Hacking out a Mobi Table of Contents from Epub

Thursday, August 11, 2011

Hacking out a Mobi Table of Contents from Epub

‹prev | My Chain | next›

My current git-scribe state:

PDF: decent output, hack implementation (mine)
epub: decent output, decent implementation
mobi: sigh

In a day or so, I need to revisit my PDF implementation, but tonight I hope to finalize the mobi output.

The latest SPDY Book mobis have been generated with asciidoc's HTML output plus additional git-scribe decoration. There are two problems, however: (1) the start page is not correct and (2) the implementation is hackish. I am beginning to suspect (2) is just the nature of mobi, but that may just be me.

Anyhow, I was finally able to solve (1) by using the epub version of The SPDY Book, adding a toc.html file and feeding it to the kindlegen command. This is appealing to me because it shortens the support chain (re-use the same epub plus a file or two) which should also eliminate some of the hacks that I have added to git-scribe so far.

The downside is that I am straying even further from git-scribe's latest release. But I think I have to. I think that gives me the best shot at producing a solid second book. So...

I change the do_epub method to keep the working epub directory around via the -k argument:

def do_epub
return true if @done['epub']

info "GENERATING EPUB"

generate_docinfo
# TODO: look for custom stylesheets
      cmd = "#{a2x_wss('epub')} -a docinfo -k -v #{BOOK_FILE}"
return false uneless ex(cmd)

@done['epub'] = true
end

I then make do_epub a prerequisite of do_mobi and include a call to a new method that will add the toc.html file:

def do_mobi
return true if @done['mobi']

do_epub

info "GENERATING MOBI"
add_epub_toc
zip_epub_with_toc

cmd = "kindlegen -verbose book_with_toc.epub -o book.mobi"
return false unless ex(cmd)

@done['mobi'] = true
end

The add_epub_toc method is something of a hack:

def add_epub_toc
Dir.chdir('book.epub.d/OEBPS') do
ncx = File.read('toc.ncx')
titles = ncx.scan(%r{^          <ncx:text>(.+?)</ncx:text>}m).flatten

urls = ncx.scan(%r{^        <ncx:content src="(.+?)"/>}m).flatten

File.open("toc.html", 'w') do |f|
f.puts('<?xml version="1.0" encoding="UTF-8"?>')
f.puts('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Table of Contents</title></head><body>')
titles.zip(urls).each do |entry|
f.puts '<div>'
f.puts '<span class="chapter">'
f.puts "<a href=\"#{entry[1]}\">"
f.puts entry[0]
f.puts '</a></span>'
f.puts '</div>'
end
f.puts('</body></html>')
end

end
end

First up, I slurp up the entire NCX (Navigation Control for XML) file from the epub:

ncx = File.read('toc.ncx')

Next I scan the ncx for chapter titles. For now, I identify chapter titles by indentation level:

titles = ncx.scan(%r{^          <ncx:text>(.+?)</ncx:text>}m).flatten

That ain't exactly pretty, but it'll do for tonight. After doing the same for URLs, I zip the two together and then iterate over them to build up the contents of toc.html:

titles.zip(urls).each do |entry|
f.puts '<div>'
f.puts '<span class="chapter">'
f.puts "<a href=\"#{entry[1]}\">"
f.puts entry[0]
f.puts '</a></span>'
f.puts '</div>'
end

A hack, maybe, but it seems to work. The toc.html:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Table of Contents</title></head><body>
<div>
<span class="chapter">
<a href="pr01.html">
Copyright
</a></span>
</div>
<div>
<span class="chapter">
<a href="pr02.html">
History
</a></span>
</div>
<div>
<span class="chapter">
<a href="pr03.html">
Dedications and Acknowledgments
</a></span>
</div>
<div>
<span class="chapter">
<a href="pr04.html">
Introduction
</a></span>
...

That looks about right. I call it a night there. I will pick back up by cleaning this up a bit and then adding the toc.html to the epub manifest. Then, I'll actually look at it on the Kindle.

Day #

japh(r) by Chris Strom

Thursday, August 11, 2011

Hacking out a Mobi Table of Contents from Epub

No comments:

Post a Comment