One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them.

Yesterday 60 years ago the first volume of the Lord of the Rings trilogy by J.R.R. Tolkien was published. The quote above obviously doesn’t quiet apply to scholarly publishing, but one recurring theme that I have often heard in the last few years is that of a need for a canonical digital document format for scholarly content that rules all other formats.

Document formats in scholarly Publishing
Document formats in scholarly Publishing

A few years ago almost everyone you would have said that xml is that format, with the NLM Archiving and Interchange Tag Suite - which has evolved into JATS - probably the most commonly used Document Type Definition (DTD). xml does many things really well, but also has important shortcomings, most importantly that it is probably not a good format for authors (and don’t tell me that docx and odt are XML-based). We therefore don’t really expect authors to submit manuscripts in JATS xml, but rather convert documents into this format after a manuscript has been accepted for publication. This conversion step is often time-consuming and labor-intensive.

More recently html has become the most interesting candidate for a canonical scholarly document format. The big advantage over xml is that html - or at least html5 which is most popular today - is an attractive format for online authoring tools (that is why html is listed both as input and output format) The downside of this flexibility is that it is much harder to embed structure and metadata into html5 compared to xml. There are initiatives such as schema.org and HTMLBook that hope to change that, but we aren’t quiet there yet.

Or maybe we should learn from Tolkien and give up on the idea of a canonical document format and rather spend our energy on building tools that make it easier to transition from one format to another. Pandoc is such as tool, but can’t do all the required conversions, e.g. it can’t yet use docx as input. The downside here is that every file conversion runs the risk of loosing important information. But the increase in flexibility hopefully outweights these shortcomings.


Next: Fragment Identifiers and DOIs

Before all our content turned digital, we already used page numbers to describe a specific section of a book or longer document, with older manuscripts using the folio before that. Page numbers have transitioned to electronic books with readers such as the Kindle supporting them eventually.

blog comments powered by Disqus