In July and August I attended the Open Knowledge Festival and Wikimania. At both events I had many interesting discussions around open source tools for open access scholarly publishing, and I was part of a panel on that topic at Wikimania last Sunday. Some of my thoughts were summarized in a blog post a few weeks ago (Build Roads not Stagecoaches). Today I am happy to announce the first public release of a tool that hopefully contributes to making publishing of open content a bit easier.

LEGO Researchers are excited that they don’t have to use Microsoft Word for manuscript writing anymore.
LEGO Researchers are excited that they don’t have to use Microsoft Word for manuscript writing anymore.

Rakali is a Ruby gem that acts as a wrapper for the Pandoc universal document converter. Pandoc is a wonderful tool to convert documents between file formats and supports many file formats and features important for scholarly publishing. Pandoc 1.13 was released last Friday, and one of the most exciting new features is a reader for Microsoft Word (docx) documents. Pandoc has supported the conversion to docx for a while, but now you can use the most popular file format for writing scholarly documents and turn your docx files into HTML, PDF, LateX, markdown, or a number of other formats, making it much easier to collaborate, and to use docx with Pandoc in scholarly publishing workflows. A good example would be arXiv, which doesn’t support docx for text submissions. Instead of turning it into PDF the manuscript can now be converted to LaTeX - the preferred file format at arXiv - before submission.

I built Rakali to make it easier to use Pandoc to convert large numbers of documents in an automated way:

  • bulk conversion of all files in a folder with a specific extension, e.g. md.
  • input via a configuration file in yaml format instead of via the command line
  • validation of documents via JSON Schema, using the json-schema Ruby gem.
  • Logging via stdout and stderr.

One interesting way to use Rakali and Pandoc is as part of a continuous publishing workflow that involves git and Github, automatically converting all files in a folder when something is pushed to the repository using a continuous integration tool, and exiting the continuous integration run when one of the files doesn’t validate. Look into the Rakali repo for an example.

The most interesting aspect of Rakali is probably validation via JSON Schema. File conversion with Pandoc is a two-step process, the intermediate format is an internal representation of the document in something called the abstract syntax tree or AST. Pandoc makes the AST accessible in JSON format, making it straightforward to manipulate a document before the conversion into the target format with something called JSON filters.

Validation of XML documents using DTDs, RELAX NG and other standards has of course been around for a long time, but validation of JSON documents is still relatively new. Since many Pandoc document conversion workflows don’t involve any XML I thought it would make more sense to validate against the AST, and we can use JSON Schema for that. I have started a Github repository with schemata for the Pandoc AST, and hope to evolve them over time using Rakali as a tool. An example log output (from the Rakali test suite, stopping file conversion because title and layout metadata are missing) looks like this:

Validation Error: The property '#/0/unMeta' did not contain a required property of 'title' in schema 9b6d454d-e609-537b-b761-9599b6c01072# for file empty.md
Validation Error: The property '#/0/unMeta' did not contain a required property of 'layout' in schema 9b6d454d-e609-537b-b761-9599b6c01072# for file empty.md
Fatal: Conversion of file empty.md failed.

As I had argued before, the challenge for building open source tools for science is to not duplicate the work of others, and to integrate well with existing tools by focussing on one aspect and doing that aspect well. It also helps to think about infrastructure (the roads) instead of only focussing on the user-facing aspects. There are obviously many document conversion tools out there, but Pandoc is certainly one of the oldest and most established ones for scholarly content. Rakali therefore builds on top of Pandoc and tries to play well with other existing tools and services, e.g. by using the UNIX stdout and stderr for reporting, and by using a file-based approach that works well with version control systems such as git. And since Rakali is a Ruby gem it can not only be used as a standalone command line tool, but can also be easily integrated into other Ruby applications.


Next: Using Microsoft Word with git

One of the major challenges of writing a journal article is to keep track of versions - both the different versions you create as the document progresses, and to merge in the changes made by your collaborators. For most academics Microsoft Word is the default writing tool, and it is both very good and very bad in this. Very good because the track changes feature makes it easy to see what has changed since the last version and who made the changes. Very bad because this feature is built around keeping everything in a single Word document, so that only one person can work on on a manuscript at a time. This usually means sending manuscripts around by email, and being very careful about not confusing different versions of the document, which requires creativity.

blog comments powered by Disqus