Authors: Guy Huard
Subjects: Internet computer network (Other articles)
SGML computer program language
Issue: Volume 4, Number 3 (September 1997)
Category: Current Developments
- The Supreme Court of Canada (SCC) is the highest level the Canadian judiciary and its decisions cannot be appealed, making them the mandatory object of constant reference and study for the whole of the Canadian legal community. Also, many of these decisions are of great interest to the citizens of Canada.
Indeed, since the adoption of the Canadian Charter of Rights and Freedoms, the Court often has to speak on political and social issues with which great parts of the population are concerned.For these reasons, the decisions of the Supreme Court have understandably been the first collection of Canadian judicial material to be published on the Internet; our team at the Centre de recherche en droit public (CRDP) started publishing them at the spring of 1994. The new service met with instant success, and the response has been increasing ever since; during 1996, the Supreme Court Web site has met more than 300 000 requests and document downloads.
- Our goal at the CRDP is to make SCC decisions available to a much larger public that can afford using existing on-line legal information commercial providers. This concern of democratising access to judicial information called for technologies that would be the most affordable for us and simple to use for the end-user. Also, with scarce financing for the project, we had to develop processes allowing for the free publishing of the information. We thus had to make the processing of the files provided by the SCC into a distribution format usable by lawyers and citizens as automatic as possible. While the service has remained unchanged in its three years so far, the technical means used to provide it have considerably evolved. The pragmatic solutions used at first gradually made way for a rigorous and automated process based on the use of SGML (Standard Generalized Markup Language).
- In the following, we will describe the more technical aspects of this project. We will first go over the corpus itself and the circumstances of its publishing (Section 2). We will then briefly consider the initial solution, which, although is now obsolete, will help in bringing attention to problems others wanting to publish legal material on the Internet might encounter (Section 3). After that, the SGML solution, the description of which constitutes the bulk of this text, will be discussed in detail (Section 4). We will conclude with a brief outline of future developments.
2. THE SCC COLLECTION AND ITS TRADITIONAL PUBLICATION
- The SCC decisions collection published by the CRDP numbers to more than 2 000 documents representing about 100 000 pages. It contains all decisions since 1989 and all decisions pertaining to the Canadian Charter of Rights and Freedoms. Although part of this corpus was batch-processed, Court decisions are generally published most weeks on Thursday morning, within an hour of their having been made available by the Supreme Court. Web publishing of SCC information is made of three parts: the weekly Bulletin, recent decisions and reported decisions, the final official versions as published in the Supreme Court Reports.
- The weekly Bulletin consists of about fifty pages of information on the Court’s activities and decisions. For example, it tells about what appeals are granted or turned down, gives a follow up on cases under consideration, and so on. It is thus a very useful publication, but not an easy document to deal with. It consist of about twenty different sections all differently organised and typographically complex. The presentation on paper is excellent, but its reuse for Internet publishing is far from obvious. Theses files are thus published in their original format and a simpler text format, which sometimes doesn’t come out quite right.
- Publishing the decisions is of course at the heart of the project. Each one is published twice. The first time is when it is made public. It must be done quickly, as in many cases the decisions are impatiently awaited. The second time comes several weeks latter, when the final revised and official version is published in the Supreme Court Report. In some cases, a rectification of this official version will come out still latter. Each new publishing of a decision obviously requires its earlier version be withdrawn. The publishing environment to do so must hence be dynamic.
- The length of judgements varies: from one page to several hundred. Yet, they are all identically structured. The order of elements of information is practically invariable. The decisions all start with a lengthy header assembled by Court personnel, followed in the same file by the motives of the judgement authored by one or more judges. The header is made of information on the judgement, like it exact reference, its docket number, the description of the bench. Motives written by judges of the Canadian Supreme Court are detailed. The reasoning of the Court is described explicitly, and dissident opinions are usually the objet of distinct motives. The text as a whole is carefully crafted: the use of typographical attributes is systematic.
3. THE INITIAL APPROACH
- The solution adopted in 1994 was made up of four main elements: the simultaneous use of different types of Internet servers; the use of several proprietary formats besides traditional ASCII; the implementation of a search function and, finally, the development of scripts to automate the processing of files for their publication.
The technologies retained and the site’s organisation
- In 1994, the Web was a new-born, Gopherspace was at its apex, and the FTP server the tool of choice of archival sites. So it only made sense to use all these protocols to make accessible the information we wished to publish. In fact, the same collection of files could be accessed through the FTP, Gopher and HTTP servers as long as our archives were carefully organised. The FTP server, where the very structure of directories guides the user, imposed a total transparency of the naming and structure of our directories. The structure must be logical and the identifiers meaningful. Although much care was given to these considerations, as the publishing service evolved into an archival site and a global electronic work, some shortcomings became apparent.
- With directory naming, certain details of no apparent initial relevance grew to become annoying. For example, some directory names using capital letters in their names were causes of frequent failed requests, as it made typing the right URL unduly complicated. The worst problem though were caused by file names. We had decided to use the same names the Court used; they are usually made up from the first letters of the name of one of the parties. But if parties must remain anonymous, or a name has already been attributed to a previous decision, or names of parties involved in a case are simply too long, another naming scheme is used by the Court’s staff. As long as they are only used by humans, as with the FTP protocol, this poses no problem. Quite to the contrary, their approximate relation with the real name is rather useful. But the automatic generation of such URLs based on formal references of cited decisions from within documents, and thus automatic hyperlinking, are made impossible by the same token (fig. 1).
Reference : MacMillan Bloedel Ltd. v. Simpson,  2 S..C.R. 1048 URL : http://CSC/arrets/1996/vol2/ascii/greenpea.en.txt
Fig. 1 : URL by initial approach
- In 1994 – as is still the case today – online commercial services offered to the Canadian legal community used 7 bits ASCII to deliver information to their users. What is mere blandness for a document in English becomes something more in French. Actually, the absence of diacritic signs in French can considerably alter the meaning of a word. Thus, the phrase “a lawyer stolen from” becomes “a flying lawyer”, or worst, “a lawyer in the act of stealing”, altogether very different propositions. Furthermore, reuse of such texts is gravely impaired; quoting parts of them requires a fastidious proof-reading to re-establish proper accentuation. For these reasons, we wished to make the corpus available in formats conserving the integral French. We thus choose three formats: two binary formats, WordPerfect 5.1 for DOS, Microsoft Word 5 for Macintosh, and the mandatory ASCII (American Standard Code for Information Interchange, the latter assuming context would suffice to establish proper meaning for those with no choice but using it for French versions. Such a choice wasn’t without consequences: all documents had to go through several conversions before they could be published. In fact, each decision had to be available in no less than eight versions: ASCII, WP, Word and Word-Binhex (the latter for the Gopher protocol) in both French and English.
- At the time, it made sense to retain WordPerfect 5.1 as a publishing format. WP 5.1 was then the de facto standard in the Canadian legal world, and it was also the format which we got form the Court. MS-Word 5 was added for Macintosh users. For all others not using micro-computers or these two formats, there was always ASCII. Alternatives such as ISO-Latin 1 and code pages supporting accented characters in the PC universe appeared solutions too dubious to be used as most common denominators. HTML was never considered, as there was no converter good enough to produce HTML automatically when we began the project. Taken together, the three complementary formats we had chosen seemed adequate. That is, until problems came up.
- Less than a year latter, Court staff informed us they were about to change their word processors. At fist glance, this change seemed only to add an additional conversion step. Nothing to worry about: we could still use our three formats. But things turned bad: from then on the original format provided by the Court would not be available on our site anymore, unless we added still another format. But there was even worse to come. When we looked at it, sooner or latter we would eventually have to reprocess a sizeable proportion of our ever-growing collection, as most of our users would shift to newer versions of the most popular word processors, making our conversion to obsolete file formats counter-productive. Sooner or latter we would have had to shift our publishing formats to something like: ASCII, WordPerfect 6 and MS Word 6. But then, what about older decisions? Leave them untouched and end up with a patchwork of formats in our archives? Reconvert everything to the latest format and get ready to do the same with each new version of MS Word? We didn’t look forward to any of these options. We were entangled with proprietary formats.
The search function implemented
- When first designed, our Supreme Court of Canada Internet publishing service dealt with a limited number of decisions. That made the choice of the search function of lesser importance. We thus implemented a freely available search engine of the WAIS type [freeWAIS-sf]. WAIS readily integrated with our Gopher server and could also be used with the Web server. Indexes were built from the ASCII versions of the decisions. Search capabilities were minimal: no fielded search and no support of French diacritic signs. This solution quickly became inadequate. The search language was too limited: it wasn’t possible, for example, to search for a phrase such as ” Charter of Rights “. Finally, many decisions being rather lengthy documents, users couldn’t be satisfied with returns on inquiries made of series of documents over a hundred pages each supposedly containing the requested terms somewhere within.
Automating the processing
- From the start, as we said earlier, minimally financed publishing for free demanded a high level of automation. It seemed possible, as the corpus itself was standardised enough to ply itself to machine processing. The process implemented was almost entirely automatic. It can be described by seven steps from the CSC offices in Ottawa to Internet publication on the CRDP servers in Montreal: 1) The files are first compressed into a single file, which is then sent by modem and FTP to a Unix server at the CRDP. The reception of this file triggers its decompression and the sending of a message to the Webmaster; 2) The Webmaster starts a script on a Macintosh which downloads the uncompressed files, and; 3) starts a second script which pilots the conversion of these files by the MS-Word, MacLinkPlus and Binhex programs; 4) the same script then uploads the converted files, MS-Word, ASCII and MS-Word/Binhex, as well as the original WordPerfect file, on the Unix server; 5) where a Perl script then proceeds to several checks. For example, if the version being processed is the official one published in the Supreme Court Report, that program will erase the original not yet reported version, and if the volume of the Report is a new one, a new corresponding directory will be created; 6) once all the formats of the decisions are inserted in their proper places, the same Perl script starts the indexation of the files by WAIS; 7) Finally a message draft for the Jurinet-l mailing list is sent to the Webmaster, who makes sure everything is as it should be and then sends to subscribers the message that new information has just been made available on the server. This whole process is repeated about a hundred times a year.
- This automated processing system worked satisfactorily for the first two years. But when the Supreme Court changed its word processor, the processing had to be altered and a good part of the automation was lost. It then seemed wiser to redesign the system from the ground up than try to patch it. From the automation standpoint, three shortcomings had to be corrected for easier maintenance: the new system should operate on a single computer; the process itself should be more modular so each conversion could be independently triggered; and the index pages used to navigate on the Web site should be generated once as part of the site update rather than being dynamically generated at each user request. On this latter point, we had realised that any error or shortcoming in the scripts called for programming to correct it, when static pages could readily have been corrected by the site editor.
Evaluating the initial approach
- The initial approach worked for almost three years. As time passed, a number of inherent problems grew, to the point of calling for a redesign of the system, as we saw. Close examination of several of these problems brought us towards SGML, especially those related to file formats and to the root of the retrieval system shortcomings.
4. THE SGML APPROACH
- The reengineering of our Web site had three main goals: 1) formats used had to provide the possibility of publishing content-rich documents as well as be stable through time; more specifically we wanted to avoid replacing proprietary formats by other such formats; 2) search mechanisms had to be reinforced to better support the French language and allow for structured search, and 3) the user interface needed to be upgraded, notably with hypertext navigation from one decision to another. Our general goal remained unchanged, that is, to design a system simple enough to operate that it could support publishing legal information for free on the Internet.
- Toward these goals, the basic design choice was a more solid and durable foundation for our Web site, one that would be independent from software producers’ marketing strategies. SGML can be such a foundation, for one of its main characteristics is to dissociate form from content, style attributes from text [Goldfarb 90; ISO-8879; Sperberg-McQueen and Burnard 93]. Our evaluation was that crossing over to SGML would quickly pay for itself. In any publishing context, this design choice involves four main facets. First, the classes of documents to be processed must be defined and modelled. Second, since most of the documents haven’t been originally produced in SGML, they must be converted from whatever original format to SGML. Third, management of such resulting SGML data must be thought through. Finally, the appropriate system for the use of this SGML data, in our case an Internet publishing system, must be designed and implemented.
- The first thing needed to bring out the structure of a document with SGML is an appropriate tag set. Such a tag set will be one of the results of the modelling process based on the document analysis. It can then be used to express the structure of a class of documents in a document type definition, a DTD. The modelling process itself can be divided in three main steps.
- First, the reference corpus must be identified and sampled. In our case, must we consider decisions of the last five years or all decisions since the 1920s? Can we envision developing a DTD for the whole of Canada’s courts or must we rather foresee a specific DTD for each tribunal, and hence a DTD specific to the Supreme Court? Answers to such questions will help in deciding what the reference collection will be. Once this is done, a representative sample must be identified with help from domain experts.
- The constitution of this sample is important, since the correctness of the model derived depends on the sample’s conformity to the collection. Moreover, although some corpuses look like they strictly follow a rigid model, surprises happen. Thus, this decision of the Supreme Court will contain a picture, or in that other one a footnote will give precisions on the constitution of the bench, and so on. All these variations might not have to be included in the collection’s document model, but to allow judicious choices to be done, the analyst must have at his disposal a representative sample. In our case, we had the good fortune to have access to the Court’s style guidelines. That information added to the chosen sample gave us enough material to produce the document model.
- Second, the sample and other information on the document class must be analysed to produce the appropriate DTD. More precisely, the sample’s analysis will yield a grammar describing a general structure and its permitted variations as well as constituting elements. Deciding what elements will be required calls for the needed level of granularity to be decided upon. For example, it must be decided if the bench is a single element or if the name of each judge is going to be tagged separately (fig. 2 a) and b)). Also, SGML allows a more or less rigid expression of the structure. Thus, two elements can have a fixed or arbitrary sequential relationship, one of them can be mandatory or optional, and so on (fig. 2 c) and d)). Such choices will be guided by the use foreseen for the DTD. If a DTD is to be used for document preparation, it could be less permissive. Another, designed primarily to support a documentary system, might be less detailed, and more flexible if it has to deal with legacy documents, as we will see in the next section.
<!ELEMENT BENCH - - (#PCDATA) >
<BENCH>Présents: Le juge en chef Lamer et les juges La Forest, L’Heureux-Dubé Sopinka, Gonthier, Cory, McLachlin, Stevenson et Iacobucci.</BENCH>
<!ELEMENT BENCH - - (SEPARATOR | JUDGE)+ > <!ELEMENT SEPARATOR o o (#PCDATA) > <!ELEMENT JUDGE - - (#PCDATA) >
<BENCH>Présents: Le juge en chef <JUDGE>Lamer</JUDGE> et les juges
<JUDGE>La Forest</JUDGE>, <JUDGE>L’Heureux-Dube</JUDGE >,
<JUDGE>Sopinka</JUDGE>, <JUDGE>Gonthier</JUDGE>, <JUDGE>Cory</JUDGE>,
<JUDGE>McLachlin</JUDGE>, <JUDGE>Stevenson</JUDGE> et
<!ELEMENT SUMMARY - - (HEADER,BRIEF) >
<!ELEMENT SUMMARY - - (HEADER|BRIEF)+ >
Fig.2: Granularity and flexibility in a DTD design. A coarse (a) description of the bench, a finer one (b). A rigid (c) or a more flexible (d) definition of the Summary element.
- Third, the model resulting from the analysis must be checked against a control sample, or better yet, against the whole corpus. In the course of our project this control step was accomplished simultaneously with the up-conversion of the entire collection. Some adjustments had to be made to our DTD as this conversion to SGML went along. The resulting DTD quite adequately expresses the structure of the Court’s header as well as the less structured briefs from the Justices, at least for our needs. (fig.3).
<!DOCTYPE CSC [ <!ELEMENT CSC - - (SUMMARY,THE.MOTIVES,FINAL.BRIEF)+(PAGE|EMPH)> <!ELEMENT PAGE - O EMPTY > <!ELEMENT SUMMARY - - (HEADER,BRIEF)> <!ELEMENT HEADER - - (TITLE.C,HEADING,DOCKET.NO,(DATES|BENCH|ORIG)+, ABSTRACT+)>
<!ELEMENT TITLE.C – – (#PCDATA)> <!ELEMENT HEADING – – (CASE,(SEPARATOR,CASE)*)> <!ELEMENT CASE – – (PARTY,(SEPARATOR|PARTY)*)> <!ELEMENT PARTY – – (PARTY.NAME,STATUS)> <!ELEMENT PARTY.NAME – – (#PCDATA)> <!ELEMENT STATUS – – (#PCDATA)> <!ELEMENT SEPARATOR – – (#PCDATA)> <!ELEMENT DOCKET.NO – – (#PCDATA)> <!ELEMENT DATES – – (#PCDATA)> <!ELEMENT BENCH – – (SEPARATOR|JUDGE)+> <!ELEMENT (JUGE|ORIG) – – (#PCDATA)> <!ELEMENT ABSTRACT – – (TERM+)> <!ELEMENT TERM – – (#PCDATA)>
<!ELEMENT BRIEF – – (BRIEF.DECISION,REFERENCES?,FINAL.PAR)> <!ELEMENT THE.MOTIVES – – (P?,MOTIVE+) <!ELEMENT FINAL.BRIEF – – (OUTCOME,PROCURATOR+)> <!ELEMENT OUTCOME – – (#PCDATA)> <!ELEMENT PROCURATOR – – (#PCDATA)>
Fig. 3: A simplified version of the CSC DTD
Up-conversion to SGML of legacy documents
- Some SGML implementation scenarios are based on the assumption that documents to be managed by the system will be originally produced with SGML-supporting tools. However, in most cases, as for our Internet publishing project, corpuses already exist and must be converted to SGML. At first glance, judicial documents would seem to be easy to convert: jurists follow rigorous reference conventions, legislative texts look very structured and carefully laid up, and even court decisions have a systematic look. But, this rigorous formalism is for the better part only skin deep and our conversion experience has led us to temper our initial optimism and agree with Sklar when he says:
“Up conversion – the translation of a document from a proprietary word-processor (WP) format to an SGML document conforming to a useful DTD – is one of the thorniest problems an organization faces when it adopt SGML. The conversion application typically involves two phases: 1) extraction and interpretation of the formatting codes in the WP format, and 2) identification of content and structure.
The second phase is the most sophisticated one, for it involves creating something (structure and true identification) from “nothing”(WP formats which are typically flat and lacking in content identification).” [Sklar 94]
- For the first step identified by Sklar – extracting the information – all files to be up-converted to SGML must first be converted to a convenient format, so the up-conversion toward SGML as such can be accomplished from a single starting point. The intermediate language chosen must be powerful enough to express useful features of documents to be converted, but also it must easily lend itself to pattern recognition searches.
- The Microsoft RTF text description language [Born 95] is often mentioned as an interesting prospect in this regard. Our first efforts in up-converting documents explored that path. However, if several up-conversion projects are to be considered or if conversion activities must go on for an extended period, it might pay to adopt an intermediary format that will make the second phase of up-conversion easier. Furthermore, many versions of RTF coexist, and even inside a given version, the code produced by different word processors is not always structured the same way.
- So, we abandoned RTF to consider HTML itself as intermediary format. In that particular scenario, we produced an HTML version of the source files with a custom RTF to HTML converter. However, HTML, even though able to express many typographical features, was not really the best formalism to conserve many typographical artefacts coming from word processor formats. For instance, the peculiar treatment of white spaces was not appropriate. These white spaces, their repetitions, often brings out evidence about the structure otherwise unavailable. Moreover, part of our interest in HTML was coming from the existence and the promise of a rich array of converters to that format. We soon discovered that each of these converters was producing HTML its own way (or used its own flavour of HTML : see for instance the HTML produced by Ms Word 97!). Another proposal, the Rainbow DTD, was brought forward by Electronic Book Technology. That DTD was precisely designed to be a universal starting point for up-conversion [EBT 94]. Unfortunately, that suggestion was not supported by the word-processor industry and the awaited converters never materialised.
- We finally resolved to design our own pivotal format. Actually, we decided to use RTF as an exchange format, thus insulating our up-conversion programming activities from the variations in the word processor formats. But to also avoid having to consider all not so well documented variations of RTF in each conversion project, and by the same token have a simpler starting point for the up-conversion process proper, we designed a typographical SGML DTD able to express the meaningful information embedded in RTF files. This way, if we have to adapt our conversion process to changing formats of source-documents, only the program converting from RTF to this typographical SGML must be modified. That intermediary DTD, which we call the X DTD, is a bit peculiar by SGML standards: it is not designed to express the structure of a class of documents, but rather the appearance of any document [Lavoie 96].
- With access to the lexical and typographical information contained in the original file thus assured, the second step identified by Sklar – identifying the elements and structure of the document – can be dealt with. For this, a conversion specification must first be established that will link elements of the DTD to their identifiable features in the source files. Annex 1 shows some of the features available in a judgement of the CSC. As can be seen, the header comes first, then the name of a party in bold followed on the same line by the status of that party. In most cases such elements are easily identified, but many surprising variations may happen. As always, the legal text is formal only in appearance ; despite the quality of the work done by the Court’s clerks, new structural features, apparently introduced for good reasons, reveal themselves in the conversion process. Nevertheless, in the more standard cases, the greater part of the collection, the clear structure and the regular typographic traits made the up-conversion possible (See Annex 2 for a tagged SGML document).
- Thus, converting to SGML from the X DTD format becomes a matter of lexical and typographical pattern recognition aiming to replace X tags with structural SGML tags from the document’s class DTD. Although up-conversion can be accomplished with a variety of computer tools adept at pattern recognition, using a DTD driven tool, a specialised tool capable of using a DTD to govern itself, will greatly facilitate such work Annex 1). The end of the element may come from the recognition of the beginning of the next element or from another recognisable lexical construct. Secondly, typographical constructs may often help to identify the beginning of an element when not a single clue may be found at the lexical level. An example of this comes from the references to other cases which always appear in italic inside the judges’ motives; when the program encounters italic in running text, it knows it’s time to look for a reference pattern. Finally, the tools build for SGML up-conversion may also refer to the DTD to know where they are in their analysis. Indeed, the context provided by the DTD often facilitates the recognition of patterns that may otherwise seem ambiguous.
- For the rest, the conversion process proceeds by successive refinements. The DTD might be relaxed, or the conversion script adjusted until an important percentage of the corpus is converted. An example of the former is the relaxation of the strictness of the patterns in regards to the white spaces. At the beginning, we strictly planned to recognise “Present: ” the characters ‘P’, ‘r’, ‘e’, ‘s’, ‘e’, ‘n’, ‘t’, ‘:’ and ‘ ‘ (yes, the white space itself). Then we noted that sometimes two white spaces exist after the colon. Then there was a tab, or three or four spaces. We finally decided that the spacing wasn’t important there: we had a very solid lexical landmark, why bother with white spaces. Even though elsewhere white spaces may have been important. In other cases, as we went back in time with the up-conversion process, taking up older and older documents, we added alternative recognition rules to match structural variations. These where a little bit harder to cope with. It is also in working on the longitudinal dimension, going back in time, that we encountered theses variations. We had to adapt and relax the DTD to make room for the past forms of the Supreme Court decisions. We proceeded year per year, adapting the programme to particular new features as we went along and testing the new version against the entire collection treated so far.
- Proceeding in such an iterative way, after some relaxation of the pattern, and some smoothing of the DTD, the corpus was 100% converted. Again, it must be said that it was of a rather exceptional quality. The Supreme Court of Canada uses great care in the preparation of its documents.
/ (4)Text Iso-Latin1 Original -\ /-(3)-> CSC DTD -< / \ (5)HTML (enriched) WP -->(1)-> RTF -(2)-> X DTD -< Ms-Word --/ \-(6)-> Text Iso-Latin1 \ \->(7)-> HTML (standard)
Fig. 4: Overview of the conversion process : (1) Word processors built-in Save as function; (2) RTF2X program; (3) X2CSC up-conversion program; (4) CSC2TXT, down-conversion program; (5) CSC2HTML cross-conversion program; (6) X2TXT, down-conversion program. The formats in bold face are those presented to the end-users.
- As we can easily see in Fig. 4, the (3), (4) and (5) programs are DTD specific. Besides these programs, all the machinery can be used as a standard HTML publishing system for any kind of documents. Furthermore, the main difference between (4) and (5) and on the other hand (6) and (7) is that the formers, starting from an SGML instance, may produce better formatted text-only or HTML documents; the latters when combined with the (2) RTF2X program become a very powerful HTML converter.
SGML data management
- Tagged documents can be stocked as flat files, the simpler way of doing it. In that respect, the current project and previous Internet publishing experience have shown us the importance of directory structures, files naming schemes, and hypertext anchors standardisation inside documents. Great care was thus used in choosing identifiers at these three levels.
- By their names and organisation, directories reflect the traditional way of referring to Supreme Court decisions. So did the directories of the previous system, but in the new system no capital characters are used and the encoding of the language version information is also standardised (see bold in fig. 5 a)). The file naming scheme has been revised as well : file names now reflect the information contained in traditional references (fig. 5 b)). Finally, in the same way, link anchors naming within files has been standardised. Paragraph numbers used by the Supreme Court since 1995 and, for decisions belonging to the Charter Collection, page numbers, are respectively called n.number and p.number. Taken together, good choices at these three levels make automatic hyperlinking possible. With such a naming scheme in our new CSC decisions publishing system, the electronic address of any page can be inferred from the standard judicial reference (fig. 5 c)).
a) Original system: /CSC/arrets/1993/vol3/ascii/creighto.fr.txt New system: /csc-scs/fr/pub/1993/vol3/html/1993rcs3_0003.html b) R. c. Jackson  4 R.C.S. 573 becomes: 1993rcs4_0573.html c) R.v. Forster,  1 R.C.S., page 345 becomes:
Fig. 5: Comparison of old and new naming schemes
- It thus becomes possible to automatically weave the said web of links within the CSC decisions corpus. The same method allows for linking to adequately organised statutory materials, such as that of the Department of Justice of Canada, also published on the Internet by the CRDP [Canada-Justice]. At a more general level, such a rigorous addressing standardisation is a sine qua non condition for a public collaborative space for judicial documentation to evolve. As we will see, future projects may provide even more benefits for the user perspective as other possibilities of our SGML database will come online.
- Tagged documents can also be trusted to a database management system (DBMS) that can deal with SGML, in cases where direct access to elements is necessary, like building different views of a document, for example. However, SGML data management with a DBMS is not an effortless proposition. The more natural solution would probably be to use an objet-oriented DBMS or a relational system tailored to SGML document management. Beside their being expensive, which is not a trivial point to whom wants to give free access to legal information, these systems so far have not yielded the level of performance required for Internet publishing. Indeed, most of these specialised systems are designed as complex document management environments and have not been optimised for transaction speed. Current relational DBMSs yield the required level of speed, but their use with SGML requires developing software layers that can manage the hierarchical structures of SGML documents in relational tables.
Internet publishing system
- Once the SGML documents collection has been constituted, the specific goals of the information system can be re-examined. The system must yield rich and stable document publishing formats, and offer the end-user a powerful retrieval tool and an interface taking full advantage of the Web’s possibilities.
- Our new SGML-based system promises to be more stable than the previous one, for it is essentially independent of proprietary formats. To fulfil this goal, we retained, along with the ever-evolving original document proprietary format, three formats that are independent of any word processing software: RTF, ISO-Latin 1 and HTML 2.0. Of the three, only RTF isn’t an open standard; however, its generalised use as an exchange format seems to offer sufficient guarantee of stability. We also wished to offer richer documents, although we didn’t see fit to directly publish SGML versions, given the current possibilities of Web browsers. The new environment is better in many ways. When before the only available version besides word processor formats was non-accented ASCII text, our new approach enables us to use HTML files looking very similar to the original version. Other enrichments to the published files have afforded us a better retrieval system and a more evolved user interface.
- Indeed, using SGML yields the explicit structure of documents and thus more powerful retrieval mechanism [Poulin 97]. Our solution relies for one part on a SGML-enabled textual search engine called NaturelNet, capable of indexing fields delimited by SGML tags [Skinner 96; Ardilog]. For the other part, SGML documents tagged according to the CSC DTD may be cross-converted to another DTD substituting the HTML tag set to the CSC DTD one, but leaving the field-delimiting SGML tags unchanged. This way, the converted files display adequately on Web browsers, as these conveniently ignore the non-HTML CSC tags. These remaining SGML tags are used to define search fields in the database the corpus has been turned into. Such fields represent elements from the SGML structure of the documents. Thus, fielded search can complement full-text search. For example, one can thus search for “1990” in the date field, along with “Lamer” in the bench field and “contempt” in the abstract field, a short summary written by CSC staff. The result of this query will be the only four decisions corresponding to such a request. Two other improvements to the search function have been made possible by the chosen approach. The new retrieval mechanism indexes the HTML files instead of the ASCII files in the previous system so the diacritical signs are now fully supported. Finally, the searched terms are highlighted in the retrieved documents and navigational links are provided to enable users to go from one hit to another easily, no small benefit when one considers the hundreds of pages of some of those documents.
- Lastly, our third objective was to upgrade the user interface. Many new features have made possible significant improvements in that respect also. The new approach first allows inserting hypertext links inside published decisions. Indeed, any reference within a decision to another decision in the collection is enriched with a link to that other document. In some cases, as with files belonging to the Canadian Charter collection, it’s even possible to establish a link to the very page referred to. These links are added when converting from SGML to HTML by taking advantage of both a hypertext access to formal references contained in documents and the clarity build into the naming scheme of directories and files.
- Coming back again to our general goal of publishing legal information for free on the Internet, we will now estimate the progress yielded by the design of our SGML system in making information processing as automatic as possible. Towards this end, let’s look at the different steps bringing a Supreme Court file to its availability on the Web.
- Forwarding of files from the Court to the CRDP was only slightly modified by a small change to the transfer procedure. A program was developed to support the preparation of files towards their being sent by FTP to the CRDP. This program enables Court staff to provide information on the files being sent in a standardised way. We believe the sending of this data along with the files themselves will considerably reduce the number of human interventions that have been necessary in the past, as its use by the file processing program at the receiving end will make such a process much more robust. The remaining three steps leading to the publication of the files on the Web are: 1) receiving the files, 2) converting them, 3) putting them on the site and indexing them. The receiving program uncompresses the files, reads the data sent by the Court staff and proceeds to a number of checking steps. Once this validation is done, this program renames the files and puts them in the entry directories for the conversion process. The conversion program is then launched and proceeds in steps to converting the files to the X, SGML, HTML and text formats. If all files are successfully converted, a third program takes them from the output directories of the conversion phase to appropriate publishing directories, creating these as needed. The same program then proceeds to update the site’s navigation pages and launches the indexation of the new files so they can be accessed with the retrieval system. Finally, a message for the Jurinet mailing list is generated and the webmaster is informed that the above operations have been successfully completed. The different components of this processing environment, especially its conversion phase, have been broken in during the upgrade of the site in the first months of 1997. In the course of this operation, the corpus was converted and the site built up in a totally automated way.
5. PERSPECTIVE AND CONCLUSION
- In the coming months, a number of projects should enable us to benefit even more from our SGML database. For example, we wish to explore solutions to an old problem linked to the size of decisions. Currently, users cannot confirm the relevance of a decision without downloading its complete file, of sometimes up to 400 k. We would like to enable them to first take a look at the decision’s summary or indexing terms. We plan to use our SGML base to enable the system to produce an arbitrary segmentation of documents that can be used to show users their more significant portions enabling them to quickly evaluate documents returned by the search engine. Furthermore, the announced coming of XML leads us to believe it will soon be possible to take full advantage of SGML on the Internet [Bosak 97]. We plan on exploring this path as soon as it opens up. Whatever happens with these future projects, it is our estimate that the investment made in SGML so far is already paying back handsomely.
- Problems due to the instability of publishing formats are now a thing of the past for us, with none of the richness of documents published having been lost, quite to the contrary. Retrieval has also been upgraded in a significant way. Finally, the user interface has been enriched with hypertext links that considerably augment the possibilities in exploring the published database of decisions.
- The authors wish to emphasise the excellent work of their student collaborators: Ernst PERPIGNAND, who programmed the RTF to X converter; Yanick GRIGNON, who adapted the search functions; Benoît ALLAIRE, who programmed the converter for the Supreme Court decisions; and Marc-André MORISSETTE, who wrote the programs for the automated processing of decisions.
- We also wish to thank the FCAR (#96-ER-1557) and FAIQ (94-035) funds for their financial support, as well as the Supreme Court of Canada for giving us access to its files in the interest of public information and the furtherance of research. We also benefited from the support of Omnimark Technologies.
- Conversion : Omnimark Technologies’ Omnimark converter ; HTTP Server : Netscape Enterprise Server ; Retrieval : Ardilog’s NaturelNet ; Automation : Perl 5.0 and Borland Delphi.
Links to described system
http://www.droit. umontreal.ca/doc/csc-scs/en/index.html — Main access
http://www.dro it.umontreal.ca/doc/csc-scs/en/reperage.html — Search functions
Original source on murdoch.edu.au.