expLog

HTML to Org

I've been attempting to maintain a books list for the past several years in various different ways.

For 2015, I'd duct taped together some scripts to run on a private server that would watch for changes to a Dropbox folder, process raw markdown files in there and stitch them together — so adding a book was as simple as adding a markdown text file to the right folder and it would show up on my site.

With my recent move to org-mode and github-pages, I basically copy-pasted the generated html into a #+BEGIN_HTML#+END_HTML section in the books.org document, and that worked reasonably well. However, I really wanted to normalize the contents to make them easier to parse and explore and so I ended up writing some chicken scheme to convert HTML to Org.


I was pleasantly surprised with how easy the html-parser API made it to handle HTML; I initially misunderstood what the seed was supposed to do but it was a breeze after I cleared that up.

(use html-parser)
(use srfi-1)
(use srfi-13)
(use utils)

;; Quickly read stdin
(define input-file
  (let loop ((line (read-line))
             (contents '()))
    (if (not (eof-object? line))
        (loop (read-line) (cons line contents))
        (string-intersperse (reverse contents) "\n"))))

;; Utility counter for lists
(define (make-counter initval)
  (let ((counter initval))
    (lambda (action)
      (case action
        ((get) counter)
        ((inc) (set! counter (+ counter 1)) counter)
        ((dec) (set! counter (- counter 1)) counter)))))

;; The actual parser to convert tags to org markup
(define parse
  (let ((list-counter (make-counter 2)))
    (make-html-parser
     'start: (lambda (tag attrs seed virtual?)
               (case tag
                 ((h3) (cons "** " seed))
                 ((ul) (list-counter 'inc) seed)
                 ((li) (cons " " (cons (string-concatenate (make-list (list-counter 'get) "*"))
                                       seed)))
                 ((a) (cons (string-concatenate `("[[" ,(cadr (assoc 'href attrs)) "]["))
                            seed))
                 ((em) (cons "/" seed))
                 ((hr) (cons "\n-----\n" seed))
                 ((sup) (cons "^" seed))
                 ((blockquote) (cons "\n#+BEGIN_QUOTE\n" seed))
                 ((strong) (cons "*" seed))
                 ((p) seed)
                 (else seed)))
     'text: (lambda (text seed)
              (cons text seed))
     'end: (lambda (tag attrs parent-seed seed virtual?)
             (case tag
               ((ul) (list-counter 'dec) seed)
               ((a) (cons "]]" seed))
               ((p) (cons "\n" seed))
               ((strong) (cons "*" seed))
               ((em) (cons "/" seed))
               ((blockquote) (cons "\n#+END_QUOTE\n" seed))
               (else seed))))))

;; Run the parser on the input, reverse because it cons's the results
(display
 (string-concatenate (reverse (parse '() input-file))))

At the other end of the spectrum, I also spent a non-trivial amount of time today typing out a book list I'd maintained in a notebook by hand from 2012 - 2013. Sadly I didn’t have the energy to type out the notes I had and decided to just record the titles.

view source