Extracting Metadata From the Body of Documents

This third blog post in a series of five is the sequel to “Converting Word files for the Web”.

Getting usable HTML and PDF files out of original Word documents is one of the first steps to take in the operation of a legal information website, but it is not enough in itself. In addition to the full text of documents, modern legal information systems also need to be fed more or less metadata. Dates, titles, file numbers and the like need to be stored in a database for search engines and scripts generating navigational pages to operate efficiently. Traditionally, this has been completed by manually copy/pasting the values from the full text of the files to a standardized form to be subsequently submitted to the database. Whether this form is a structured file with a predefined format or a web-based interface, the approach nevertheless requires substantial human labor, which can be avoided only by managing to acquire structured data from the sources.

This is where the introduction of labels and styles in a Word template starts to pay off. During the conversion process these structural elements are included in the HTML files: labels as raw text and styles as tags. At this stage of the publishing process, it becomes possible to parse the HTML files in order to detect corresponding values. This is achieved through pattern matching using regular expressions checking for the presence of specific sequences of tokens. For instance, whenever the sequence “Date: ” is followed immediately by a sequence of the type “####-##-##” or “####/##/##” in the first 25 lines of the file, it is fair to presume that this is the proper date to use for the document. This is even easier and more precise where styles have been applied since values will be wrapped in <div> and <span> tags, such as “<span style=”decision_date”>####-##-##</span>”.

Word Styles

Once detected, the values can easily be inserted automatically in the database. However, while extraction of values based on styles can be pretty accurate, pattern matching on labels necessarily involves the risk that some of the values detected may be erroneous. For this reason, some level of human validation is recommended. This can be undertaken systematically by requiring editors to view and approve individual metadata entries into the database. Going further it is also possible to bring potential errors to the attention of the editors using automated validation filters. Such filters can verify the integrity of the data format, but can also include pre-computed lists of values. For instance, detecting judges names in a decision can be a tricky business since it is very easy to confuse them with the names of the parties or counsels. However creating a list of judges with alias for each jurisdiction makes it possible to check detected values against that list and confirm validity. In the end, the human efforts required for validating automatically extracted metadata is substantially inferior to that otherwise required.

Style example

The potential of pattern matching for the dissemination of legal information can be illustrated by the Rwanda Law Reports (RLR) project completed by Lexum for the Judiciary of Rwanda in 2013. In this case, the judiciary was provided with all the templates, instructions and editorial policies required for the establishment of formal law reports of judicial decisions originating from all of Rwanda’s jurisdictions. The decision template was drafted in Word and, aside from small adjustments that had to be made to conform to local specificities, the RLR decisions look essentially like other reported decisions from any common law jurisdiction. The template relies heavily on styles, with over 30 distinct styles used, including specific ones for decisions keywords, summaries, etc. Thanks to this approach RLR decisions are produced from scratch in a structured format. Coupled with a dissemination system capable of leveraging this structure (such as Decisia by Lexum), online dissemination of the RLR can be fully automated. Dragging and dropping Word files into a web interface is all that is required to generate a state-of-the-art RLR website.

Signup to our newsletter to keep up to date with our latest blog posts!

Read the next blog post “Creating mobile friendly website“.

Norma

We’ve Been Named One of North America’s Top 100 Inspiring Workplaces for 2025

Articles

Help

Company

Work

Grow

Get In Touch

Extracting Metadata From the Body of Documents

Keep Reading

Apprenez-en plus

We’ve Been Named One of North America’s Top 100 Inspiring Workplaces for 2025

The ALPRT Adopts Norma!

Lexum reimagined

Ready to put Norma to work?

Our solutions

Insights

Who we are

Connect with us