I ended last night with no fewer than 3 Tables of Contents in the mobi version of SPDY Book as generated by git-scribe. Today, I hope to get at least one of them in shape to serve as the TOC that is recognized by the Kindle.
To be fair, one of the TOCs generated, the NCX (Navigation Control file for XML applications), is generated and read correctly by the Kindle. The Kindle does not use this file when you "Go to Table of Contents". Rather, it uses this to draw the chapter markers in the progress meter at the bottom of the display.
The TOC that readers actually see is stored in a file named
toc.html(the filename is described in a separate
book.opffile). Git-scribe 0.0.9 is fairly adept at generating this file although preface material (introduction, copyright notice, acknowledgements) confuse the chapter numbering. My switch from
a2xfor generating the HTML has further confused git-scribe's
toc.htmlgeneration—the chapters are already numbered (so I end up with things like "Chapter 8: 4. SPDY Push").
a2xcommand produces a pretty nice TOC, albeit directly embedded in the book HTML. So I extract that TOC out of the book HTML and into a toc.html file:
def extract_toc content = File.read("book.html") File.open("book.html", 'w') do |f| f.write content.sub(%r|<div class="toc">.+?</dl></div>|m, '') end toc = Regexp.last_match. gsub(/href="#/, 'href="book.html#') File.open("toc.html", 'w') do |f| f.puts('<?xml version="1.0" encoding="UTF-8"?>') f.puts('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><title>Table of Contents</title></head><body>') f.puts toc f.puts('</body></html>') end endThe bit at the beginning about slurping the entire book into memory and replacing the TOC, via regular expression, with nothing is a bit on the oogy side:
content = File.read("book.html") File.open("book.html", 'w') do |f| f.write content.sub(%r|<div class="toc">.+?</dl></div>|m, '') endI live with this for now because, for my 150+ page SPDY Book, I am not seeing a significant performance hit. The actual contents of the TOC are now in
Regexp.last_match. I need to adjust the location of the URL now that the TOC will be in a separate file from the actual book:
toc = Regexp.last_match. gsub(/href="#/, 'href="book.html#')The rest is just a matter of writing to the
Sure this could use some improving, but I think it is already a step in the right direction for git-scribe. Instead of scanning the entire HTML document for
H3tags, I am now using the TOC as generated by
Unfortunately, I am not done slurping the entire book into memory. There is a bit of clean-up necessary to remove white-space in
LItags (to get bullet lists to align properly) and more properly identify header tags:
def clean_html(file) content = File.read(file) File.open(file, 'w') do |f| f.write content. gsub(%r"<li(.*?)>\s*(.+?)\s*</li>"m, '<li\1>\2</li>'). gsub(%r'<h( class="title".*?)><a (id=".+?")></a>'m, '<h\1 \2>') end endThe first
gsubdoes multi-line matches to remove whitespace inside
# Source: <li class="listitem"> SPDY-ize your own sites—either by writing your own SPDY parser or using one of the frameworks discussed. </li> # Result: <li class="listitem">SPDY-ize your own sites—either by writing your own SPDY parser or using one of the frameworks discussed.</li>The second gsub is just a workaround for a Kindle quirk:
#Source: <h2 class="title"> <a id="chapter_your_first_spdy_app"></a> Chapter 2. Your First SPDY App </h2> # Result: <h2 class="title" id="chapter_your_first_spdy_app">Chapter 2. Your First SPDY App</h2>Inline links (e.g.
<a href="book.html#chapter_your_first_spdy_app">) work with either format on the Kindle, but the formatting is messed up for the former version—the
H3text displays like normal text.
With that, I am more or less satisfied with my mobi formatted version of SPDY Book. There are a few tweaks that I still might like to make, but I think tomorrow I will revisit some of the hacks I needed for the PDF version of the book. Armed with what I know now, I think I can come up with a better approach.