Converting Word Files for the Web

This second blog post in a series of five is the natural sequel to “Making the Most Out of Word Templates and Styles”.

Generating cleaner, more structured Word files is one thing. Converting them into file formats adapted for the web is another. Legal information websites require HTML and PDF versions that need to be converted in some way from the original Word files. Of course anyone can use the “Save as” feature of Word for this purpose, but doing so creates two difficulties. First, manually converting files one by one may work at low volume, but quickly becomes unmanageable as the number of documents to be processed increases. Second, the HTML code generated by Word can be described as dirty, to the point where it is almost useless “as is” for a professional website.

InternetThe obvious solution to the first problem is to batch process the conversion of files from Word to both HTML and PDF. For this purpose Lexum developed its own file converter, Polyglotte, as early as the mid-1990’s. Polyglotte has been updated in parallel with the release of each new version of Microsoft Word that has appeared since. Polyglotte basically manages multiple conversion requests and drives an installation of Word as a virtual server. It is currently set up on a cluster of several physical servers capable of handling hundreds of conversions per minute. It is made available online as a secure web service that is continuously called by the various products and bespoke systems developed by Lexum.

In order to address the second problem, namely, the quality of the HTML code outputted by Word, Polyglotte has also been equipped with cleaning filters. These filters are applied on the HTML files produced by Word before they get sent back. First they fix the hectic use of fonts made by Word, replacing symbolic fonts by their Unicode equivalent and removing superfluous font faces. Second they operate a series of required fixes in the HTML tags. These fixes include the removal of unneeded tags such as empty “Style” and “Span” tags, page breaks, erroneous Word fields and so on. They also improve components that are not well supported by Word in HTML but that are critical in the context of legal documents, such as footnotes. Ultimately, the files returned are clean enough for online display, especially if styles have been applied consistently in the original Word files. In such a case, the converted HTML styles can even be used by the legal information website’s Cascading Style Sheet (CSS) to specify the look and formatting of the online version of the documents.

WordThe main benefit of the approach described above is to turn what could be a complex publishing operation into a turn-key service anyone can operate. Polyglotte provides Lexum’s clients with the capacity to turn any number of Word files into HTML or PDF with a single click of a button. Moreover, relying on the Word format itself instead of using a simpler text-based alternative makes it possible to avoid the loss of components, such as pictures and tables of contents that are based on more advanced features of Word.

The most compelling illustration of how a small team of non-technical staff can transform hundreds of complex legal documents from Word to HTML in no time is provided by the CIDCOM project initiated by the Confédération Générale des Entreprises de Côte d’Ivoire. Based in Abidjan, an editor using Lexum’s OyezOyez platform was able to single handedly post online all available economic integration documents originating from several regional bodies (including the WAEMU and ECOWAS) as well as the corresponding implementing laws of Ivory Coast. As at the date of writing, close to a thousand legislative documents have been posted on the CIDCOM website, many of them automatically converted to HTML from the original Word files.

Signup to our newsletter to keep up to date with our latest blog posts!

Read part 3 of 5: Extracting MetaData from the Body of Documents.

Photos credit here and here.