Sharing Information on the Semantic Web: The Reminiscence of an Old Legal Issue

01.07.2008

“Letting your data connect to other people’s data is […] not about giving to people data which they don’t have a right to. It is about letting it be connected to data from peer sites. It is about letting it be joined to data from other applications. It is about getting excited about connections, rather than nervous.”

Tim Berners-Lee

The success story of open source software (OSS) makes us see very clearly that in a networked world, centralized production of information is not the only viable model. It is now largely understood that distributed production can often equal and surpass it, both in quality and quantity. This has led people in all disciplines to rethink their relationship with information, giving birth to a plethora of initiatives generating value by promoting the mass collaboration of individuals over shared sets of information.Based on rich Internet applications, wikis, social tagging or social networking technologies, these initiatives gave birth to a revolution that has been dubbed Web 2.0. Whether they originate in non-profit or business ventures, they all add up to the ever increasing mass of accessible and reusable information.

For current information hubs that have been developed through independent channels, it is anticipated that the next step in the evolution of the web will make seamless integration possible. This development should create tremendous opportunities for those capable of building innovative services and knowledge products on top of this shared knowledge base. In fact, along with the technological foundations of this web of ideas, practical commercial implementations are already starting to appear. However, these early experiments highlight the fact that the most important challenge to overcome might not reside in the technology itself. Instead, the management of rights may, more than anything else, hinder the efficient aggregation of distributed information.

Accessibility, Reusability, and Interoperability

Software developers realized a long time ago that while access to information is one thing, the ability to reuse it is another. For them, binary code and restrictive software licences stood as a solid barrier between the two concepts. In response, some chose to adopt an alternative development model promoting the sharing of source code with permissive software licences. While some of these licences, like the Berkeley Software Distribution (BSD) licence, simply favour a certain level of reciprocity among developers, others, such as the GNU General Public License (GPL), go further and secure the openness of the code they cover.

More recently, the open source approach to licensing has been expanded to cover a wider range of contexts. This resulted in the apparition of hybrid development models offering both the possibility for users to adapt software to their respective needs and the preservation of some restrictions on its circulation. Altogether, these experiences have shown that software is a much more valuable asset when it is reusable. In addition, they have demonstrated that diverse reuse conditions can fit diverse needs and expectations.

To some extent, a similar evolution occurred for other kinds of information circulating over the Internet. Following in the footsteps of OSS developers, innovative entrepreneurs have learned to adapt and expand collaborative development models to produce a large array of information products.This resulted in the creation of impressive information commons such asWikipedia, media repositories like Flickr and Youtube, social bookmarking systems likedel.icio.us and Digg, as well as social networking websites such as Facebook or LinkedIn.

Once again, the development of dedicated licensing schemes has been crucial in this outcome. The Creative Commons (CC) movement, in particular, has been extremely helpful in clarifying the spectrum of rights and reuse conditions that can be attached to shared information. This, in turn, has led to domain specific licences, such as the Australian Free for Education licence which allows the free circulation of information for education purposes while imposing conditions on other types of reuse. The experiences of the last few years have shown how businesses can thrive on accessible information by promoting different forms of reusability.

As a consequence, the volume of information accessible on the Internet under technical and legal conditions that make reuses possible is growing at an incredible pace. Up to very recently, the collaborative initiatives driving this transformation have evolved independently from each other. While pictures uploaded by Flickr users are distributed under permissive terms, their reuse is still mostly limited to other users of the same service. The same can be said about all of the flagship initiatives of the Web 2.0 revolution. Because the information flow of these web services has been limited to the vertical direction, most of the accessible data is now compartmented into separate information silos. While silo construction might have been inevitable, the possibility to reuse the information they contain calls for improving the reciprocity between sources of information. For this goal to be achieved, a horizontal information flow must complement the current one. Data from one source must become mixable with data from other sources through various layers of services. Interoperability it is said, is the key to this puzzle.

The World Wide Web Consortium (W3C) has tackled this issue for several years. Thanks to its efforts, technical solutions making interoperability possible are widely available and documented. They include data modeling languages like XML and RDF, syndication technologies such as RSS and ontology standards like OWL. The W3C hopes to encourage web developers to annotate the information they disseminate, creating a computer readable web parallel with the current human readable one. Doing so would give a comprehensible meaning to data, allowing its dynamic discovery, rearrangement and execution. This is what has come to be known as the web of ideas, or Semantic Web, which was recently renamedGiant Global Graph by Tim Berners-Lee. Unfortunately, if the idea of a fully semantic web is appealing in theory, its practical implementation has mostly been limited to the academic field.

The approach taken by industry has been slightly different. Using heuristic, or text-recognition technologies, businesses aiming to leverage semantics to gather accessible information from around the web have started to appear. Focusing on partial sets of data from specific fields, they manage to recognize limited ranges of concepts and to associate data from distinct sources accordingly. Over the last couple of years, this form of limited semantics have given birth to practical applications. The first wave revolved around specialized search engines, such as Spock. The second wave seems to be oriented toward shortcuts, or the analysis of content to quickly deliver additional information. Yahoo! Shortcuts and Lingospot stand out as promising initiatives in this category. Using artificial intelligence to automatically create links between distributed data, all of these services are partially circumventing the requirement for technical interoperability.

Whether they will ultimately take one form or the other, the role and scope of semantic technologies are bound to increase over the next few years. Under their influence, it can be envisioned that the aggregation and integration of the important volume of information that is already accessible and reusable will soon become technically possible. Undoubtedly, this outcome could generate completely new markets for information products, taking advantage of everything that flows over the networks.

Entrepreneurs currently involved in the development of semantic technologies are realizing that interoperability has more than one side. In addition to understanding the meaning of the data, they are increasingly confronted with the necessity to understand the legal conditions attached to it. Indeed, the diversity of the restrictions imposed on the reuse of accessible information by the gigantic number of existing copyright notices and licences is the most important obstacle to its aggregation. Because of this challenge, automated reuse of information originating on the web needs to be limited to preselected sources that can be trusted. Otherwise, reproduction of the content must be avoided. As odd as it may seem, the old issue related to the proliferation of licences is coming back to haunt the next generation technology.

The Fragmentation of Rights

Web content licences, just like software licences, are found in an ever increasing number of forms for the simple reason that copyright holders are free to control the reproduction of their works as they see fit. The wide range of diverging motivations, commercial interests, and business strategies has made the fragmentation of rights according to infinite reuse conditions inevitable. The specifics of the various formats under which online content can be distributed as well as the existence of distinct domains of application have also contributed to the phenomenon.

Moreover, the need to adapt licences to the context of various jurisdictions that often have conflicting legal requirements has created an additional layer of complexity over the licensing landscape. In the end, the difficulty to manage the resulting diversity of possible terms and conditions is amplified by the fragmentation of rights down to the smallest elements of information. Reuse restrictions are not necessarily attached to entire websites, or even to specific web pages. Distinct licences can potentially govern every bit of data they disseminate.

The difficulties generated by this situation are not fundamental as long as humans are fully in charge of the reuse of information. However, the efficient aggregation and integration of distributed information and the successful implementation of a semantic web require computers to manage this process, at least partially. To achieve this, they first need the capacity to retrieve the applicable licences for the available information. Second, they require a mechanism to resolve the actual meaning of these licences. Third, they must be capable of selecting only the information disseminated under adequate conditions for the anticipated reuse. If any of these three operations proves impossible to automate, it is probable that the recent innovations in the field of semantic technologies will never reach their full potential.

If the possibility to aggregate reusable information from the web once again puts forward the problem of the fragmentation of rights, that problem has been addressed by several organizations in the past. The Free Software Foundation has repeatedly warned OSS developers against the threat posed by the proliferation of licences to the compatibility of source code. Specialized products such as those developed by Black Duck Software are specifically designed to address this issue.

The CC has promoted the most effective measure against proliferation through its set of standardized licences. By encouraging web developers to embed licensing information into their content, CC eases its retrieval by crawlers and other web robots. By providing a computer readable version for each licence, it makes the resolution of their terms possible. By standardizing terminology and keeping the number low, it also facilitates any posterior selection to be made by third parties. For all of these reasons, content distributed under CC licences has been central to aggregation efforts undertaken up to now. Although the numerous merits of the CC approach cannot be challenged, it does not entirely solve the issue generated by the fragmentation of rights. While a growth in web content covered by its licences certainly increases the volume of information becoming available for aggregation, it does little to deal with the mass of reusable data that is not (and often cannot be) distributed under a CC licence. The fact that copyright holders have the right to attach alternative restrictions to the circulation of their works, coupled with the understandable policy of CC not to automatically accept any new licence proposal, accounts for the need to develop a more encompassing solution. The best illustration of the limitations resulting from this situation is probably the Google search engine feature entitled usage rights. Because it entirely relies on CC tagging of web pages, it completely ignores all of the text Wikipedia made available under the terms of the GNU Free Documentation License. It is precisely to fix this problem that a higher level resolution mechanism is required.

A Global Licences Repository?

Can the CC vision of a lawyer readable, human readable, and computer readable version of copyright-related information be expanded to all licences covering content circulating over the Internet? The legal code of relevant licences being accessible online and its standardization being out of the question, the only workable solution might lie in the conception and implementation of a database of licences and their respective conditions. Organizing the multitude of licences under a single template would allow for the streamlining of their resolution and selection. It would also allow for the development of a web service that could be queried indifferently by users and computers. Obviously, there are a large number of obstacles that may prevent the completion of such a repository. The large number of reuse conditions, as well as the numerous format and domain specific restrictions, are certainly barriers. Issues related to the internationalization and the versioning of licences are another. In addition, managing compatibilities between licences in order to make relicensing possible can prove to be a daunting task. Nevertheless, paths can be imagined to circumvent each of these obstacles. Conditions and restrictions could be organized into groups or categories. Licences could be managed at the lowest possible level and related ones associated together. The designation of compatibility could be limited to the most common licences.

Notwithstanding its design, the successful implementation of a global licences repository would also depend on the proper interaction of several key elements. The large-scale adoption of a tagging model allowing the effective detection of licensing information by content aggregators is one of the most important. The involvement of a community of users in feeding and updating the database is another as central management would be impossible to achieve.

The nature of the data involved calls for the necessity to generate trust in the system by insuring transparency and adequate quality control procedures. Finally, extended use of the repository will only occur if its outputs are provided in a large range of standardized formats matching the various requirements of extremely diverse users, in conjunction with simple communication tools facilitating the interactions with the system.

Conclusion

Although this proposal would have sounded like an extremely ambitious undertaking only a few years ago, OSS and other collaborative initiatives have demonstrated successes in distributing and managing efforts adequately. Automating the management of licensing information will require substantial investments of knowledge and energy by a broad range of players. Ultimately, it will need to be done for the web to reach its next phase of evolution. Otherwise, the fragmentation of rights will continue to impede technologies allowing the dynamic discovery of data from ever achieving their promise of opening the large-scale reuse of distributed information.

Source: Timreview