Chapter 4

From The SDI Cookbook

Jump to: navigation, search

Contents

Chapter 4: Geospatial Data Catalogue: Making data discoverable

Jeroen Ticheler, GeoCat and Open Source Geospatial Foundation Jeroen {[at]} Ticheler.net

Introduction

An increasing volume of information is now considered critical to everyday decision making in modern society -- a large portion of this information is essentially related to "place" in the context of position on the Earth. As more online information includes some geographic context, the ability to describe, organise, and access it has become increasingly difficult. The ability to discover and access geographic data resources for use in visualisation, planning, and decision support is a requirement to support societies at the local, regional, national, and international levels. Common solutions have been developed and will be described in this Chapter by evaluating organisational approaches, comparing definitions of community, identifying common architectural solutions, and sharing a base of techniques that are implemented in available noncommercial and commercial standards-based software.

This Chapter presents the concepts, current practices, and designs for geospatial data discovery. It is intended as a guide to those interested in the management, development, and implementation of compatible discovery services in environments where the cross-domain publication of geographic information is desired. Organisational issues and roles are presented that are critical to the understanding and maintenance of the services within a larger spatial data infrastructure. The principles described herein can be interpreted and applied in a range of information management conditions from non-digital collections of map information, through small digital catalogues, to integrated repositories of data and metadata. Relevant standards and software are identified for evaluation and application.

Context and Rationale

Although the Internet is becoming the world’s largest repository of knowledge, its navigation is hindered by the lack of a surrogate and comprehensive catalogue. As a result, one is delivered tens of thousands of candidate documents in response to a reasonable query from today’s search engines. Fortunately, geographic information frequently has signatures of location in the form of coordinates or place names and even may have a reference date or time associated with the data. These metadata provide a key to a solution that can and does operate in an international context.

The library has long formed the primary metaphor for accumulation and management of knowledge about people, places, and things. Since the construction of the ancient library in Alexandria, Egypt to its modern day equivalents, libraries have employed classification systems, specialisation, and discipline to information in all forms. A central feature in this virtual library – and a critical part to its navigation and use – is the catalogue. In the context of geospatial information management, we use the descriptions of geospatial data, or metadata, as described in Chapter 2 as the common vocabulary to frame the structured fields of information that we seek to manage and to use in search and retrieval. These metadata elements are stored and served through a user-accessible catalogue of geospatial information.

Support of a discovery and access service for geospatial information is known variously within the geospatial community as "catalogue services" (OpenGIS Consortium), "Spatial Data Directory" (Australian Spatial data Infrastructure), and "Clearinghouse" and the “Geospatial One-Stop Portal” (U.S. FGDC). Although they have different names, the goals of discovering geospatial data through the metadata properties they report are the same. For the purpose of consistency within this document, these services will be referred to as "catalogue services." Further integration of these services with web mapping, live access to spatial data, and additional services can lead to exciting user environments in which data can be discovered, evaluated, fused, and used in problem-solving. Whereas this chapter will focus on finding spatial data and services, combination of the practices described here with those in other chapters can expand the capabilities of your spatial data infrastructure.

Distributed Catalogue Concepts

The Catalogue Gateway and its user interface allows a user to query distributed collections of geospatial information through their metadata descriptions. This geospatial information may take the form of “data” or of services available to interact with geospatial data, described with complementary forms of metadata. Figure 4.1 shows the basic interactions of various individuals or organisations involved in the advertising and discovery of spatial data. The boxes are identifiable components of the distributed catalogue service; the lines that connect the boxes illustrate a specific set of interactions described by the words next to the line.

A user interested in locating geospatial information uses a search user interface, fills out a search form, specifying queries for data with certain properties. The search request is passed to the Catalogue Gateway and poses the query of one or more registered catalogue servers. Each catalogue server manages a collection of metadata entries. Within the metadata entries there are instructions on how to access the spatial data being described. There are a variety of user interfaces available in this type of Catalogue search in various national and regional SDIs around the world. Interoperable search across international Catalogues can be achieved through use of a common descriptive vocabulary (metadata), a common search and retrieval protocol, and a registration system for servers of metadata collections.

Figure 4.1
Figure 4.1 - Interaction diagram showing basic usage of distributed catalog services and related SDI elements from a user point of view.

The Distributed Catalogue environment is more than just a catalogue of locator records. The Distributed Catalogue includes reference and/or access to data, ordering mechanisms, map graphics for data browsing, and other detailed use information that are provided through the metadata entries. This metadata acts in three roles: 1) documenting the location of the information, 2) documenting the content and structures of the information, and 3) providing the end-user with detailed information on its appropriate use. A traditional catalogue, as found in the modern library, provides only locational information. In the era of digital data, the edges between the data or services and the catalogue can become blurred and permit the management of extended information called metadata that can be exploited by computer software and human eyes for many uses.

Organisational Approach

Who are the individuals or actors involved in the publication and discovery of geospatial information? By defining the roles and responsibilities that these actors play, one can understand the essential functions that human or computer-assisted services should be able to conduct in the interest of resource discovery for the GSDI.

Terminology:

Data Set - a specific packaging of geospatial information provided by a data producer or software, also known as a feature collection, image, or coverage.

Metadata - a formalised set of descriptive properties that is shared by a community to include guidance on expected structures, definitions, repeatability, and conditionality of elements.

Metadata Entry - a set of metadata that pertains specifically to a Data Set.

Catalogue - a single collection of Metadata Entries that is managed together.

Catalogue Service - a service that responds to requests for metadata in a Catalogue that comply with certain browse or search criteria.

Catalogue Entry - a single Metadata Entry made accessible through a Catalogue Service or stored in a Catalogue.

Service Entry - the metadata for an invokable service or operation, also known as operation or service metadata.

Roles

Figure 4.2 shows interactions between the Actors, the functions they perform, and the objects they interact with. The illustration uses Unified Modeling Language (UML) notation to picture processes from a functional point of view.

Originator of the Metadata Entry -- The responsibility of this Actor is to generate conformant metadata elements packaged so they accurately reflect the contents of the information being described. The role and credentials of the person responsible for the creation of this metadata may vary among organisations. In some situations the originator may be the scientist involved in building the data set being described. In others, the originator may be a contractor or second party who was directed to create the data or the metadata based on some project requirements, or it may be a generic description created by a production-oriented organisation without mention of the names of individuals involved in its creation. Given the rarity of metadata still, it is also a common practice for a third party to interpret or derive a metadata entry from available information where formal metadata has not yet been created.

Contributor to the Catalogue -- The responsibility of this Actor is to provide one or more conformant metadata entries to a Catalogue. Metadata entries may be delivered in proper format, derived from other formats, or developed from information stored in data and software systems. S/he interacts with the management functions of the Catalogue Service that permit a metadata entry to be entered, updated, deleted, or to assign levels of access or viewing privilege.

Catalogue Administrator -- The responsibility of the catalogue administrator is to manage the metadata for access by the Users. The maintainer or keeper of the metadata may be the same as the contributor, it may be a collecting organisation acting on the authority of an entire organisation (e.g. librarian or web site content manager), or it may be a different party who has acquired metadata in some form and is providing public access to it. The Custodian authorises access to the Catalogue Service for Management functions including entry, update, or deletion, manages authorisation details, and may perform some quality assurance evaluation on entries. The Custodian may also manage external (client) access to the Catalogue if it is not publicly accessible.

Catalogue User -- The responsibility of this user is to define criteria by which geographicallyrelated information could be located and used through use of Browse categories or posing a fielded or full-text query. This user may or may not be GIS-literate, but with the Internet is likely to not be familiar with or possess GIS or image processing software. This User may have a weak understanding of geography. Another common method of catalogue access may be through a program to discover and work with Catalogue information. The interaction occurs at the software level and assumes a documented interface (e.g. application programming interface) for submitting requests to and receiving responses from a Catalogue.

Gateway Manager -- the responsibility of the manager is to develop, host, and maintain the distributed search capabilities within the user community. This may also include management of or contribution to a directory of servers (registry) that participate in the national or regional SDI.

Figure 4.2
Figure 4.2 - Interaction diagram showing basic usage of catalog services and related SDI elements.

Using the actors from Figure 4.2 as described in the text, the following sections will describe the organisational or operational management requirements for distributed catalogue services compatible with the GSDI based on the following areas of interest

Catalogue Service development Catalogue gateway and access interfaces Registering participants

Each section will include a Use Case to focus on the roles and actions that should be considered in creating a discovery component of your SDI.

Catalogue Server/Service Development

The Distributed Catalogue services assume some degree of distributed ownership and participation. Similar activities on the Internet have taken a fully centralised approach to metadata management by placing all metadata in an index on one server, or in several replicated servers. In an increasingly dynamic data management environment, the synchronisation between detailed metadata and such an index becomes increasingly difficult. This problem is experienced on a daily basis when conducting searches on Web search engines and getting a “404: File not found” error when a document has been moved or changed. In addition we are seeing a migration toward treating metadata and data as interrelated and even being managed together within a single database. To duplicate this metadata in an external index can be costly and invites problems with synchronisation of the data, its metadata, and the externally indexed metadata. Organisations who already manage spatial data and are interested in publishing it are often the most capable candidates for publishing and maintaining the metadata. Metadata co-located with data on a server tend to be more current and detailed than metadata published to an external index (harvested and indexed off-site).

The construction of a catalogue service capability for geospatial information is built upon on the commitment to collect and manage some level of geospatial metadata within an organisation. The following Use Case scenario describes the publishing of a metadata entry.

A contributor of metadata receives the description of a new spatial data set developed by other professional staff. This metadata is generated in a transferable encoding format to allow exchange of the metadata without loss of context or information content. This metadata entry is passed to a catalogue administrator for consideration and loading to the catalogue. The catalogue administrator applies any acceptance criteria on the quality of the metadata as required by the organisation. If the metadata are acceptable it is inserted into the catalogue. The catalogue administrator then updates the catalogue to reflect the new entry as available for public access. This data set is now considered advertised because its metadata provide a searchable and browseable record of its background, its temporal and spatial extent, and many other searchable characteristics.

There are several models for where Catalogue services might be installed within or among organisations. Generally speaking, a catalogue server is usually installed at the level of organisation appropriate to the nature of the data or metadata, the organisational context or mandates, and the level at which a catalogue can be operationally supported.

Consortium Approach -- The consortium model is one where a single metadata catalogue is built and operated at one location and is shared by multiple organisations with a common discipline or geographic context. Metadata are exported from contributors and are forwarded to the common site where they may be evaluated, loaded, and made publicly accessible. This model may work well where there are personnel and computer access constraints and a shared service provides or extends outreach. The consortium approach also encourages collaboration between participants in building a collective data and metadata resource base across the organisations. The liabilities of this approach may include managing complexity and contributions from many sources and being sure that metadata provided stay synchronised with the data being described. Data might not be co-located with the catalogue service but may be referred to at contributor locations. Corporate Approach -- The corporate model assumes that all metadata are forwarded within an organisation to a single service at which time corporate issues of quality, publication, style, and content may be evaluated. This model allows personnel and networking resources to be focused on developing and managing a single service and computer within an organisation. Some degree of policy must be established within the organisation for the collection and propagation of the metadata to the corporate host. This model is well-suited to organisations who may be restricted to providing a single public access computer for security reasons. The liabilities of this approach may include managing contributions from many sources within the organisation and being sure that metadata provided stay synchronised with the data being described. Data may be co-located with the catalogue service or may be referred to at contributor locations.

Workgroup Approach --The workgroup model assumes that a service would be established at each place within an organisation where data are collected, documented, managed, and served. This follows the trend on the Internet in which virtually anyone on a connected network can be considered a "publisher" of information. The workgroup model also assumes that the individuals and groups most closely associated with the collection and revision of the information are also involved in its catalogue and service. This can lead to a high degree of synchronisation between the data and their metadata -- in some cases, data and metadata warehouses could be completely integrated. The liabilities of this approach may include technical expertise in catalogues at the local level and coordination issues across a given organisation.

Because of the nature of the distributed catalogue and its ability to search many servers, all of the suggested models listed are equally viable. In fact, close reading of the model descriptions will show that they represent a continuum of organisational choices that vary in complexity, governance, and the degree of integration with the data being described.

Alternative Approaches

The operational design of a distributed catalogue as described above, depends in large part on the ability for clients to use the proposed services. Globally, access to computers and communications networks supporting Web applications is still available to a small minority of the population. While this is changing in almost all regions through providing community public access points, building and subsidizing network construction and interconnection, the distributed catalogue may not be well suited to conditions in many developed and developing countries where the Internet is not yet common or bandwidth is lacking. There are two solutions that have been prototyped and are suitable for public information access in such environments.

For organisations and clientele who have limited access to computers or networks, metadata can be reprocessed and printed and distributed as paper catalogues. Printing and distribution costs may be significant but a wide audience can be reached through public libraries and organisations interested in using spatial data in decision making. Synchronisation with current data content and holdings in such paper catalogues may also be an issue. Paper distribution of catalogues can always be considered a supplement to digital information service methods.

If Internet services are present and available to the public but network bandwidth within the region of interest is limited, individual catalogues may wish to support harvesting of metadata from remote sites in "mirror" catalogues, or “metadata caches”. A good example of this would be for supporting regional data discovery across multiple servers in different locations whose connections are low-speed. If each catalogue posted its metadata in a Web-accessible directory, a crawler or harvester program could retrieve and index metadata from other sites into a regional or replicate index. This methodology is being demonstrated in the United States to provide a single synchronized point of access to metadata that are fetched from a small to moderate number of sites. Note that this still suggests that the combined collection itself is still behind a server with a common interface, but potentially fewer standing servers are required in this architecture. At the extreme end of this design one could envision a few large metadata repositories with common search interfaces. Primary concerns about the scalability of this approach include supporting extremely large searchable metadata indexes and the synchronization of the indexes with remotely held metadata and data. It is not likely that this approach would scale to support a single global collection of metadata using current technologies although Web search engines are capable of such searches but lack geographic awareness.

In environments where both data providers and clients have access to computers but not reliable networks, the creation of CD-ROM or DVD media with searchable metadata (and perhaps even data) is another outreach mechanism. Creation of digital media with metadata and data will be of greatest benefit where standard metadata and data approaches are followed, and a catalogue (software and data) could be placed on the media to minimise the cost of deployment where a catalogue already exists.

These alternatives should be viewed as approaches that supplement the catalogue services recommendations described in this Chapter until such time as the information is accessible to the majority of intended clients via the Internet. Use of the catalogue services will immediately enable international academic, commercial, and governmental use of such information for regional analysis issues.

Catalogue Gateway and Access Interface Development

Within a given geographic or discipline-based community, the need will exist to build relevant search capabilities that facilitate intuitive search across many servers. This problem can be divided into two related parts that must interrelate -- a user interface (Search/Browse Interface, fig 4.2) and a query distributor (Catalogue/Gateway Portal, fig 4.2). When performed across the Internet, these functions may be logically deployed in different locations although they tend to be coupled together in server-based or client-based search solutions.

Figure 4.3
Figure 4.3 - Configuration options for Gateway and User Interfaces to Distributed Catalog

Figure 4.3 shows the possible configurations of a catalogue gateway and the user interface. Client A accesses a user interface that is downloaded (as forms or an applet) from a host on the Internet that is also managing multiple connections to servers. Client B is accessing a user interface from a location that is different from that of the Gateway supporting the construction of customised user interfaces for a community. Client C is a client-side "desktop" application that is fully self-contained and includes the user interface and distributed query capabilities for direct connection to remote servers. What is not known on this diagram is the dependence on or reference to a registry or Directory of Servers, as shown in Figure 4.2, which is further explained in the next section. All three styles of interaction are known to exist in various SDIs. Because they all depend upon distributed catalogue servers the three approaches are fully compatible.

Two styles of interaction are known to exist in Web search interfaces that are equally well applied to distributed catalogue access. The first style is query in which the user specifies search criteria for search using simple to advanced interfaces. The second style is a browse interface in which the user is presented with categories of information and selects paths or groupings, often in hierarchical form, to traverse.

The search approach to interaction with distributed catalogues can provide extra precision for advanced users in selecting spatial data of interest. It often is implemented in iteration to discover what effects individual parts of a query have on the pattern of results returned. The browse approach has great appeal to novice users who may wish to navigate by reference without knowing proper search words or fields a priori. The challenge of constructing and supporting browse mechanism across a global collection of servers is the work required in building and supporting a universal vocabulary for classification and its hierarchy or word space, known as an ontology. As this service lies at the intersection of many disciplines of interest, the construction of a single classification system is an extremely daunting and improbable task. Intelligent classification systems that are run externally on collections using neural networks, Bayesian probablitiies, and other estimates of "context" may be available in the coming years to help users navigate through heterogeneous geospatial information.

A Use Case scenario for a query user is as follows:

1. A User uses client software to discover that a distributed catalogue search service exists. 2. User opens the user interface and assembles the query elements required to narrow down a search of available information. 3. The search request is passed to one or more servers based on user requirements through a gateway function. The search may be iterative, repeating or refining queries based on new interactions with the user. 4. Results are returned from each server and are collated and presented to the User. Types of response styles may include: a list of "hits" in title and link format, a brief formatting of information, or a full presentation of metadata. Visualisation of multiple results may also be available through display of data set locations on a map, thematic groupings, or temporal extent. 5. User selects the relevant metadata entry by name or reference and selects the presentation content (brief, full, other) and the format (HTML, XML, Text, other) for further review. 6. User decides whether to acquire the data set through linkages in the metadata. By clicking on embedded Uniform Resource Locators (URLs) the user can directly access online ordering or downloadable resources, whereas distribution information lists alternate forms of access.

A User Case scenario for a browse user is as follows:

1. A User uses client software to discover that a distributed catalogue search service exists. This may be done through a search of Web resources, a saved bookmark, reference from a referring page, or word-of-mouth referral. 2. User opens the user interface and selects the parameters required to narrow down a search of available information based on topics/subjects, organisations, geographic location, or other criteria. Parameters are usually grouped into hierarchies for the user to navigate. 3. Requests are made to each server through a distributed request mechanism. 4. Results from each server are collated and presented to the User. Form of organisation of results is controlled by the user interface and gateway collaboration to present a uniform result space. 5. User selects the relevant metadata entry by name or reference and selects the presentation content (brief, full, other) and the format (HTML, XML, Text, other) for further review. 6. User decides whether to acquire the data set through linkages in the metadata. By clicking on embedded Uniform Resource Locators (URLs) the user can directly access online ordering or downloadable resources, whereas distribution information lists alternate forms of access.

Registering Catalogue Servers

The nature of distributed catalogues requires that the knowledge of the existence and properties of any given catalogue participating in a community be known to the community. In support of GSDI concepts, the need for a dynamic and comprehensive directory of services including catalogue servers is ever more important. The directory of servers concept allows an individual catalogue operator to construct and register service metadata with a central authority. This registry is then a searchable catalogue in its own right so that software may discover suitable catalogue targets based on their predominant geographic extent, descriptive words or classification, country of operation, or organisational affiliation, among other properties. Already national listings of compatible catalogue servers have been built, but the operation of a global network of catalogue servers within GSDI will require that a common directory of servers be built and managed to assure current content, distributed ownership, and authoritative reference to servers.

The features of the directory of servers may include:

  • One descriptive entry per service collection (server metadata)
  • Ability for a donor to contribute or update a record in the directory
  • Ability to validate access to a server, as advertised
  • User browse access of online server metadata
  • Software search access of server metadata
  • Management of active/inactive records, accessibility statistics

Several national distributed catalogue activities support management services for server-level metadata and contain references to servers predominantly in their country. The GSDI now sponsors a global directory of catalogue servers for all countries to utilise, with delegation of authority made to participating countries to manage and validate host information for their servers (http://registry.gsdi.org/registry) but it does not provide for the cataloguing of all service types at this time. The UDDI (http://www.uddi.org) offers the potential of a public, replicated “universal business registry” hosted by IBM, Microsoft, and SAP, that could be used by SDI publishers to advertise the existence of their services. Research into the use of the UDDI as a service directory for the GSDI is underway.

Relevant Standards

The GSDI distributed catalogue has been designed with maximum reliance on existing technologies and standards. Because of this, existing software can be re-utilised or adapted to support geospatial information without requiring special investment in new technologies. Key standardisation efforts in access to catalogues are found in the ISO 23950 Search and Retrieve Protocol, the OpenGIS Consortium Catalogue Services Specification Version 1.0, and relevant standards or "recommendations" of the World Wide Web Consortium (W3C).

ISO 23950, also known as ANSI Z39.50, is a search and retrieval protocol developed initially in the library community for access to virtual catalogues. Key features of the ISO 23950 protocol include:

  • Support of registered public "field" attributes for query across multiple servers where they may be mapped to private attributes
  • Platform-independent implementation over TCP/IP using ASN.1 encoded protocol data units
  • Ability to request both content (known as Element Sets or groups of ‘fields’ such as Brief or Full) and presentation format (Preferred Syntax, e.g. XML, HTML, text)
  • GEO (Geospatial Metadata) Profile with registered implementation guidance for current FGDC and ANZLIC metadata and soon to include ISO 19115 metadata elements

The use of a generalised query protocol on ISO 23950 permits a migration from national forms of metadata to future forms being developed through international consensus under ISO Technical Committee 211 and their draft metadata standard 19115. Even though the metadata standard will change, the GEO Profile specifies the meaning of search fields in a way they can be mapped to multiple metadata schemas where compatible elements exist. Under the GEO Profile search of international metadata can be achieved today across collections in the United Kingdom, the United States, Africa, Canada, Latin America, and Australia in a single search, even though different underlying local metadata models exist.

The OpenGIS Consortium published a Catalogue Services Specification in 1999 that provides a general model for geospatial data discovery through a catalogue that includes management, discovery, and data access services. These general services are described for implementation in the OLEDB, CORBA, and ANSI Z39.50 (ISO 23950) environments. The management functions include the ability to specify interfaces for creation, entry, update, and deletion of metadata entries to a catalogue. The discovery functions include the ability to search for and retrieve metadata entries from a catalogue with embedded references within the formal metadata to on-line data access, where available. The access functions support extended access to or ordering of spatial data based on references established in the metadata. Only the discovery functions are deemed mandatory in the Catalogue Services implementations; guidance is provided for implementation of optional management and access (really ordering) in interoperable ways.

At the OGC meeting in Southampton, U.K., a common catalogue services approach was presented and demonstrated that built upon the essential search and retrieval model of ISO 23950. Initial implementation specifications in Version 1.0 of the Catalogue Services Specification were submitted for CORBA, OLEDB, and ISO 23950. Distributed parallel search across these different protocols was demonstrated through an extension of commerciallyavailable gateway software.

A Web-based HTTP Protocol Binding for Catalogue search is being published in Version 2.0 of the OGC Catalogue Service Specification. OGC Testbed activities have shown the popularity of the HTTP-based approach to catalogue services that still applies the basic tenets of ISO 23950. Known variously as the “Stateless Catalog” and the “Web Registry Service” this protocol binding will be known as the “Catalogue Service – Web (CS-W)” and will complement the CORBA and ISO 23950 bindings defined in Version 1.1.1.

The International Standards Organisation (ISO) has a Technical Committee, TC 211, dedicated to the standardisation of abstract concepts relating to geospatial data, services, and the geomatics field in general. The International Standard for metadata (ISO 19115) provides a comprehensive vocabulary and structure of metadata that should be used to characterise geographic data. The companion Technical Specification ISO 19139 defines the encoding of this metadata. The development of national and discipline-oriented profiles of ISO 19139 will facilitate exchange of information using common semantics and syntax.

The World Wide Web Consortium (W3C) is a group of implementing organisations interested in developing common specifications, known as "recommendations' for wide support on the Web. One key set of recommendations and work items focus on the Extensible Markup Language (XML), a markup language specifically geared to encoding structured content of information. Companion topics include the XML-Schema activity, working on defining the schema and data types for XML documents and XML-Query -- at present only a design activity for a request syntax for XML-structured documents. The XML 1.0 Recommendation is in general use now, and is seeing wider application in the geographic software field as an increasingly richer means to encode and transfer structured information of all types. XML-Schema has recently been approved by the W3C and supports more rigorous validation of XML files.

Implementation Approach

The development of operational distributed catalogue services has been taking place in a number of countries including the United States, Canada, Mexico, Australia, and South Africa as primary examples. The software systems used to implement the ISO 23950 and Web based services has been developed largely through governmental support, resulting in both open source and commercial software solutions. The evolution of protocols and industry practices are difficult to predict, but this section provides a review of available solutions.

Let's review a technical use case scenario for access to a distributed catalogue:

1. A User uses client software to discover that a distributed catalogue search service exists. This may be done through a search of Web resources, a saved bookmark, reference from a referring page, or word-of-mouth referral. 2. User opens the user interface and assembles the parameters required to narrow down a search of available information. 3. The search request is passed to one or more servers based on user requirements through a gateway service. The search may be iterative, repeating or refining queries based on new interactions with the user. 4. Results are returned from each server and are collated and presented to the User. Types of response styles may include: a list of "hits" in title and link format, a brief formatting of information, or a full presentation of metadata. Visualisation of multiple results may also be available through display of data set locations on a map, thematic groupings, or temporal extent. 5. User selects the relevant metadata entry by name or reference and selects the presentation content (brief, full, other) and the format (HTML, XML, Text, other) for further review. 6. User decides whether to acquire the data set through linkages in the metadata. By clicking on embedded Uniform Resource Locators (URLs) the user can directly access online ordering or downloadable resources, whereas distribution information lists alternate forms of access.

The Distributed Catalogue is implemented using a multi-tier software architecture that includes a Client tier, a middleware or “Gateway” tier, and a server tier, as is illustrated in Figure 4.4.

Figure 4.4
Figure 4.4 - Implementation view of distributed catalog services

The client tier is realised by a traditional Web browser or a native search client application. The Web browser uses conventional HyperText Transport Protocol (HTTP) communications, whereas the native search client uses the ISO 23950 protocol directly against a set of servers. It is possible to also collapse this multi-tier architecture into two tiers where middle-tier functionality is present in the client.

The middle tier in the architecture includes a World Wide Web to catalogue services protocol gateway. A Gateway effectively converts an HTTP POST or GET request into multiple catalogue service clients that run either in series or in parallel. Gateway solutions provide parallel distributed search of multiple catalogue servers from a single client Web session. At present, Gateways have been installed in the U.S., Canada, Mexico, South Africa, Australia to provide regional points of access. The forms and interfaces installed at each are identical, and each hosts parallel search of all servers. In order to track a large number of Distributed Catalogue servers, a list of known, compatible servers called a Directory of Servers or Registry must also be managed. This service contains server or collection-level metadata that can itself be searched as a special catalogue. In this way, an intelligent one pass search of eligible servers can be performed instead of requiring the user to select servers from a list, or to have all queries passed to all servers.

At the bottom tier of the service architecture are the catalogue servers. These servers can be accessed using the GEO Profile of the ISO 23950 protocol, although CORBA implementations also exist. The GEO Profile of ISO 23950 is available to implementors in the geospatial community as an extended set of the traditional bibliographic fields that can be searched. GEO includes geospatial coordinates (latitude and longitude) and temporal fields in addition to freetext (e.g. search for the word anywhere in the metadata entry). ISO 23950 servers may be implemented on top of XML document databases, object-relational, or relational database systems in which structured metadata are stored for search and presentation.

The ISO 23950 protocol was selected for use in the Distributed Catalogue for several reasons. First, the library catalogue service community existed with relevant software and specifications that could be enhanced for geospatial search. By adopting compatible terms, library catalogues can be searched with GEO catalogues. Second, the ISO 23950 protocol specifies only client and search behavior and does not specify the native data structures or query language used to manage the metadata behind the server. Abstraction of query allows for a public query on “well known” fields that can be translated at each server into local equivalents. This lets one keep current database structures and names but supports alternative access through this geospatial public "view," expressed in XML or HTML reporting forms. This common search functionality across hundreds of servers is a prerequisite to distributed search. It allows for local database management autonomy yet supports federated search. Third, the protocol is independent of computer platform. ISO 23950 search clients and servers exist for many types of UNIX and Windows platforms, and Java libraries are available for additional client and server programming.

This separation between local and public metadata search fields has allowed for the ISO 23950 search of many different types of metadata collections that support the GEO Profile, even though they may not support the same metadata model. For example, The Australia and New Zealand Land Information Council (ANZLIC) metadata contains different tag names than FGDC metadata in the US. Through standard translation tables in the server, search against ANZLIC's "Data Set Name" field is associated with "Title" (the query labels this as attribute number 4) in the registered public fields. As a result, Australian catalogue servers can be searched through the FGDC Clearinghouse Gateways but return metadata records of a different structure. The same approach could be applied to other community metadata services, such as those employed by the Directory Interchange Format (DIF) files used in the space and global change disciplines or other metadata standards with similar content. Ideally, metadata formats should be delivered in such a structure that they could be converted or translated for consistent presentation, even if they come from different communities. The Extensible Markup Language (XML) and translator software is starting to enable the transformation of different XML documents in different schemas.

Catalogue Server/Service Development

To encourage widespread participation in the Clearinghouse, catalogue service software has been developed under direction of the FGDC and other coordination organisations around the world. Reference implementations of software exist to provide a free or low-cost example of metadata management and Distributed Catalogue service that can be quickly implemented. The software can also be used as reference by commercial developers to test anticipated functionality and interoperability and to develop value-added products.

A catalogue service that participates in a distributed catalogue should fulfill the following requirements:

  • Support of a standard protocol (ISO 23950 preferred) for search and retrieval on an Internet-accessible server. When conformance testing for OGC Catalogue Services profiles is available, servers should be certified as OpenGIS-compliant (no conformance test methodology exists as of February 2000).
  • Linkage to an indexed metadata management system that supports multi-field queries on text, numeric, and extended (e.g. "bounding box") data types, supports AND and OR constructs, and can return entries in a structured form that are or can be converted into a requested report in HTML, XML, and text. This may be a relational database, an objectrelational database, or an XML database, or even a request to a remote catalogue to perform cascading catalogue services.
  • Ability to translate public fields/attribute structures into names and structures used in the metadata management system using a national or international vocabulary (ISO 19115, when available)
  • Ability to add, update, or delete metadata entries in the metadata management system

Available Software Implementations

The Isite software suite is a reference implementation of the Catalogue server that includes an XML document database and an ISO 23950 server supporting the GEO Profile for use on Windows and UNIX platforms. The U.S. Federal Geographic Data Committee is one of several sponsors that continue to support the development of this open-source software code. Isite supports document types conforming to the ANZLIC (Australia/New Zealand), Directory Interchange Format (DIF), Federal Geographic Data Committee's (FGDC) Content Standard for Digital Geospatial Metadata, and the draft ISO 19115/19139 interpretations, and is used in a number of countries that support these content standards.

Several commercial catalogue services supporting the OpenGIS Consortium Catalogue Services Specification Version 1.0 Web Profile via ISO 23950 are available on the market today. Links to known commercial solutions are posted on the Federal Geographic Data Committee web site (http://www.fgdc.gov/clearinghouse). When Version 2.0 of the OGC Catalogue Services specification is released and conformance testing methodologies are available, validated OGC-compliant software will also be listed from the OpenGIS web site (http://www.opengis.org).

Catalogue Gateway and Access Interface Development

As depicted in Figures 4.3 and 4.4, there is often a need for an intermediary to provide application integration for an end user. Known as "application servers" or middleware, these hosts allow for the storage, construction, and download of user interfaces to end users and communicate with multiple catalogue servers simultaneously -- a feat not supported by many web browsers due to security settings.

Software systems, such as application servers, that integrate catalogue search and other GIS and mapping functions benefit from the community development of software development kits (SDKs) based on standards. SDKs can provide client and server libraries for catalogue search and other services based on standard interfaces. Through component architecture, these SDKs expedite development of advanced software by combining appropriate pieces of software together as needed, reducing the need for a programmer to learn the intricacies of a given service.

A UNIX-based reference implementation gateway from the World Wide Web to multiple ISO 23950 targets is available for non-commercial use from IndexData in Denmark, known as ZAP (http://www.indexdata.dk). A perl-based programming client library to ISO 23950 is also available from the Joint Research Centre in Italy (http://perlz.jrc.it/download). A Java-based distributed search module to multiple ISO 23950 targets from common web servers is also being commissioned as open source software by the US FGDC as is a client-side Java library.

Registering Catalogue Servers

The operation of a growing network of distributed catalogue servers requires the management of server-level information in a central location. This registry server, shown in Figure 4.4, essentially houses server or collection-level metadata for search and retrieval and use in distributed query. In this way a search may be first made of the registry of servers to identify candidate servers to target the query, and as a broker, the registry returns the list of likely targets based on criteria such as geographic and temporal extent and other search limits. A registry facility greatly improves the scalability of a national, regional or global network of catalogues.

In the context of the GSDI, a coordinated registry of catalogue (and other) services is needed. If all catalogues were registered into a common and distributed registry akin to the way the Domain Name System (DNS) works, resolution of appropriate hosts of geospatial information globally will be enabled.

The GSDI hosts a global, seachable registry of catalogue servers using Isite fed by XML generated from an Access database. All geospatial catalogues conforming to FGDC, ISO, or ANZLIC metadata profiles should be registered here. This will be replaced with a conformant OpenGIS Catalogue solution supporting ISO metadata in the coming year (http://registry.gsdi.org/registry). A coordinated registry between the U.S. and Canada is proposed through an interagency agreement between the FGDC/GSDI Secretariat and Geomatics Canada as a model for other countries to follow in managing and cooridnating their own national catalogue entries with the global system.

Recommendations

  • The Cookbook authors recommend that organisations publish their metadata using OpenGIS Consortium Catalogue Services Specification, Version 2.0.2.

The baseline standard, HTTP Protocol Binding, better known as Catalogue Service for the Web (CSW) provides for nominal interoperability for search and presentation of records among all implementations. Profiles exist for the explicit search and retrieval of ISO 19115/19139 records (ISO Metadata Application Profile), and a more general-purpose profile called ebRIM (sometimes listed as Web Registry Service). In addition, the ISO 23950/ANSI Z39.50 "Search and Retrieve" Protocol binding is still in common use, as referenced by this standard. Existing reference implementation software for catalogue services allows organisations to participate at a very low cost; commercial implementations allow organisations to scale their collections and applications.

  • The Cookbook authors recommend that participants register their catalogue servers at the GEOSS Component and Service Registry (CSR).

The Group on Earth Observation (GEO) hosts a global service registry that acts as a directory of all known web services in the Earth Observation and SDI communities. By listing catalogue services in such a system, publishers can assure that they can be discovered in a trans-national context.

References and Linkages

OpenGIS Catalogue Service Implementation Specification, Version 2.0.2, Open GIS Consortium, (http://www.opengeospatial.org/standards/cat)

Z39.50 International Standard Agency Home Page, (http://lcweb.loc.gov/z3950/agency/)

Personal tools