< expLog

HTML to Org

I've been attempting to maintain a books list for the past several years in various different ways.

For 2015, I'd duct taped together some scripts to run on a private server that would watch for changes to a Dropbox folder, process raw markdown files in there and stitch them together — so adding a book was as simple as adding a markdown text file to the right folder and it would show up on my site.

With my recent move to org-mode and github-pages, I basically copy-pasted the generated html into a #+BEGIN_HTML#+END_HTML section in the books.org document, and that worked reasonably well. However, I really wanted to normalize the contents to make them easier to parse and explore and so I ended up writing some chicken scheme to convert HTML to Org.


I was pleasantly surprised with how easy the html-parser API made it to handle HTML; I initially misunderstood what the seed was supposed to do but it was a breeze after I cleared that up.

(use html-parser)
(use srfi-1)
(use srfi-13)
(use utils)

;; Quickly read stdin
(define input-file
  (let loop ((line (read-line))
	     (contents '()))
    (if (not (eof-object? line))
	(loop (read-line) (cons line contents))
	(string-intersperse (reverse contents) "\n"))))

;; Utility counter for lists
(define (make-counter initval)
  (let ((counter initval))
    (lambda (action)
      (case action
	((get) counter)
	((inc) (set! counter (+ counter 1)) counter)
	((dec) (set! counter (- counter 1)) counter)))))

;; The actual parser to convert tags to org markup
(define parse
  (let ((list-counter (make-counter 2)))
    (make-html-parser
     'start: (lambda (tag attrs seed virtual?)
	       (case tag
		 ((h3) (cons "** " seed))
		 ((ul) (list-counter 'inc) seed)
		 ((li) (cons " " (cons (string-concatenate (make-list (list-counter 'get) "*"))
				       seed)))
		 ((a) (cons (string-concatenate `("[[" ,(cadr (assoc 'href attrs)) "]["))
			    seed))
		 ((em) (cons "/" seed))
		 ((hr) (cons "\n-----\n" seed))
		 ((sup) (cons "^" seed))
		 ((blockquote) (cons "\n#+BEGIN_QUOTE\n" seed))
		 ((strong) (cons "*" seed))
		 ((p) seed)
		 (else seed)))
     'text: (lambda (text seed)
	      (cons text seed))
     'end: (lambda (tag attrs parent-seed seed virtual?)
	     (case tag
	       ((ul) (list-counter 'dec) seed)
	       ((a) (cons "]]" seed))
	       ((p) (cons "\n" seed))
	       ((strong) (cons "*" seed))
	       ((em) (cons "/" seed))
	       ((blockquote) (cons "\n#+END_QUOTE\n" seed))
	       (else seed))))))

;; Run the parser on the input, reverse because it cons's the results
(display
 (string-concatenate (reverse (parse '() input-file))))

At the other end of the spectrum, I also spent a non-trivial amount of time today typing out a book list I'd maintained in a notebook by hand from 2012 - 2013. Sadly I didn’t have the energy to type out the notes I had and decided to just record the titles.