Lexum’s New Approach to PDF Publishing

Lexum is rolling out a new approach to PDF publishing for use across its products. We are introducing a new PDF publishing format that meets sophisticated digital publishing requirements (understand “feature rich”) but at an accessible low price point.

Currently there are two options commonly used for publishing documents on the web. First, there is the widely used option of publishing documents in PDF format. PDF allows material to be published with ease, at a low cost and perfectly reproduces the original formatting of the documents. However, traditional PDF does not allow for most of the features required to carry out sophisticated research and cross-referencing activities. Metadata are often scarce and search is far from optimal (if you are not already convinced of the downsides of PDF publishing, you should definitely read this post). A second option for publishing is to convert PDF files into HTML, which involves extracting and processing the PDF content. This process allows for more options for optimal use of the document, including increased search functionalities. However, in doing this extraction you lose most of the formatting of the original document and, of course, massaging the HTML to make it look similar to the original file generates delays and higher costs.

Our new approach merges the best of both worlds. When provided with a PDF file, Lexum is now using the image of the file for display, and an XML version of the content for indexing purposes, assembled together in one HTML file.  Since the XML content is invisibly overlaid on top of the image, text can be selected by users completely seamlessly. Custom metadata can be extracted and added to the HTML file, facilitating indexation by search engines. You thus get the perfect display from the PDF, combined with the improved digital capabilities afforded by the XML format, such as cross referencing, indexing and making full use of metadata.

Here are a few illustrations of features that this new approach allows us to include in our publishing services:

Retention of the original formatting: See Faryna v. Chorny, 1951 CanLII 252 (BC CA). a case originally published in the Dominion Law Reports, now available on CanLII. It has retained the page numbering from the original publication, enabling quick navigation by page number.

Words running on two lines are properly indexed and searchable: A search of the word “sincerity” finds the word spread across two lines on page 357 of the decision. A similar search on a corresponding PDF file would have overlooked this result.

Links can be added to legal citations: These links can point to content contained in CanLII in the Canadian context, or to Fastcase in the US context. The overlay of XML content allows for us to create links to content based on citations that are already contained in the body of the text. This is especially useful when the original reference format is preserved from the PDF.

“Lazy loading” allows for quicker and more efficient navigation:  Only a limited number of page images will load at a time, although the user can still search the entire document. See “The Preparation, Citation and Distribution of Canadian Decisions” on Qweri which is navigable by a quickly loading table of contents, rather than loading the entire document at once.

Lexum is currently adopting this new approach across all of its publishing services. It has been released as part of Qweri version 1.18 and will replace the current inline PDF approach in the upcoming version of Decisia. It is also available to clients interested in personalized editorial services, such as processing of archives for publication in a third party database. The introduction of this mid-market approach now enables organizations of all sizes to undertake high quality publishing of archives at a fraction of the cost it used to be.