The Isaac Network: LDAP and Distributed Metadata for Resource Discovery

Christopher Lukas
University of Wisconsin-Madison
1210 W. Dayton St.
Madison, WI, USA 53706
phone: (608) 265-4678
fax: (608) 265-9296
lukas@cs.wisc.edu

Michael Roszkowski
University of Wisconsin-Madison
1210 W. Dayton St.
Madison, WI, USA 53706
phone: (608) 265-9234
fax: (608) 265-9296
mfr@cs.wisc.edu

ABSTRACT

Locating information on the World Wide Web is decidedly difficult. The problem has spawned a growing number of small, standalone collections of metadata describing high-quality Internet resources. The Isaac Network provides a means to link such geographically distributed metadata collections into an entity searchable as a single collection. It uses standard Internet protocols to make the metadata records available, process user queries, and transfer index information between servers. The goal of the network is to bring together high-quality collections of human-cataloged Internet resources to provide a resource discovery system that allows searchers to query the information by field, such as title, subject, author, or description. It uses the Dublin Core as its standard metadata description format. The Isaac Network currently consists of three nodes with a combined collection of approximately 3900 metadata records and continues to grow.

1.0 INTRODUCTION

The relative ease of use of the World Wide Web has resulted in an explosion of users on the Internet and an explosion of information available on the network. The issue for users today is to be able to locate resources that they deem relevant and of high quality. Ultimately each individual has to decide for themselves which resources are relevant and credible in a given situation. However the task of discovering these resources would be much simplified if the pool of possibilities was narrowed to a manually, pre-selected subset of resources chosen by information specialists to be of high quality.

Today there are numerous collections of "quality resources" available on the Web, however they are generally individual, autonomous sites, not connected in any way to other quality collections. Internet users, especially those in academia, need the ability to send a single search command, which will reach all the quality collections, and just as importantly, only the quality collections. The Isaac Network is being built to provide them with this service.

2.0 THE PROBLEM

The general problem focused on by most resource discovery and metadata research is that of providing the user an effective means of finding relevant, high quality Internet resources. A subset of existing solutions to this problem consists of numerous, relatively small, hand-created collections of third-party metadata describing Internet resources. Typically centered on a particular subject, and often called "subject gateways," these collections can be very useful in finding a resource of interest because the searchable metadata is created and filtered by people. However, the decentralized nature and lack of common administration of these projects has resulted in a proliferation of wildly varied methodologies and search interfaces. The net result is that thousands of person-hours invested in the collection and creation of these metadata records is underutilized because most users are unaware of the existence of all subject gateways and their location. Ideally, the user should be able to use a single interface to send a single query, which will reach all subject gateways and return relevant results from each back to the user.

The goal of allowing a single search to contact multiple search services is not, in itself, novel. Both the Harvest [1] system as well as the Stanford Proposal for Internet Meta-Searching [2] are examples of how metadata can be extracted, combined, and used for searching multiple information sources. These systems are, however, primarily meant for automated metadata extraction and searching via a centralized "metasearcher." While the automated aspect is highly desirable, the results presented to the user may not be of high relevance to their needs or of high quality.

Closer to the architecture of the Isaac System is the CHIC-Pilot system [3] which utilizes components and structure similar to Harvest, but is not as specific as Isaac and does not currently provide index-based query routing. The closest predecessors of the Isaac System are the Resource Organisation And Discovery in Subject-based Services (ROADS) [4] system and the Arts and Humanities Data Service (AHDS) [5]. These services use the Whois++ directory services protocol [6] and the Z39.50 protocol [7], respectively. While the Z39.50 protocol is powerful, it is also quite complex and does not provide for query routing which we feel is a necessity for scalability. The Isaac System is conceptually similar to the ROADS system and provides like functionality, but emphasizes the ability to make multiple collections searchable as a single entity.

3.0 THE GOAL

We have created one possible solution to this problem: the Isaac Network. The Isaac Network is a virtual network that allows distributed administration and control of metadata repositories while simultaneously enabling each site to participate in a collaborative network that creates a single, searchable collection of metadata. The Isaac architecture is built on the shared index capabilities of the Common Indexing Protocol (CIP) and the query-routing capabilities of the Lightweight Directory Access Protocol (LDAP). The goal of the Isaac Network is that each participating Isaac node (consisting of LDAP server, CIP client and server, and metadata repository) provide a common search interface along with knowledge about the other participating nodes; the user will then be able to search all nodes in the network via a single query from a common interface.

Beyond this primary goal, the Isaac System was created with a number of subgoals in mind:

  • The metadata in each repository should be high quality, specifically it should be manually applied by information specialists using a minimum of descriptive fielded data. The Scout Report Signpost [8] database of metadata is an example of such a repository.
  • The administration, creation, and maintenance of the metadata should be distributed.
  • The network of distributed metadata repositories must be scalable in terms of the number of nodes that are queried and indexed.
  • There should be little or no required central administration of the network.
  • The system should be fast and robust, but allow the investigation of novel algorithms for aspects such as indexing, query routing, rank merging, and so on.

These goals lead the Isaac team to develop a system that should be useful as both as a powerful, wide-ranging resource discovery system and as a tool with which to investigate future metadata problems.

4.0 METHODOLOGY

We designed and implemented the Isaac System by combining and extending existing protocols and technologies in a novel way. To our knowledge, the LDAP protocol has been used only for white pages type directories. The Isaac project is the first to use an LDAP directory for metadata records about resources, and to combine LDAP with CIP in a distributed index-sharing and query-routing architecture.

Isaac consists of three main software components: the metadata repository, the search service, and the index service.

4.1 THE METADATA REPOSITORY

The metadata repository is a database of metadata that is accessible via the Lightweight Directory Access Protocol (LDAP) [9]. It is, essentially, a directory containing metadata about Internet resources instead of white pages information. The repository requires that participants utilize a subset of the Dublin Core [10] metadata attributes, but provides an object-oriented schema model whereby the core attribute class may be extended through inheritance.

We require all collections that become part of the Isaac Network to contain at least the following information for each resource record: Title, Author or Publisher, Description, Identifier (URI), and Subject/Keywords. 

Figure 1. The Metadata Repository.

 

Using the LDAP protocol provides two important advantages to the system. First, the LDAP protocol is becoming widely adopted and embedded in numerous applications such as browsers. While there are currently no tools that provide an automatic awareness of third party LDAP schemas, we believe that the use of this protocol will enhance the accessibility of the metadata in the Isaac Network.

More importantly, the LDAP protocol provides the capability to route or refer the searching client to other repositories within the Isaac Network that are likely to contain records that answer the query. This aspect is a key component of Isaac because, when combined with the index service, it enables efficient searching of the entire network without broadcasting the search to every repository.

4.2 THE INDEX SERVICE

The index service operates in concert with the repositories and is the key component behind the linking or networking of the distributed repositories. At a high-level, the index service extracts a representation of the information contained on each repository. This representation (or index) is used during query processing to determine which participating repositories are likely to contain records that match the query. Using this information, the query is routed only to repositories that are likely to contain matching records and not to repositories that do not.

The indexing service is distributed among the participants and consists of two main components: the index server and the index client. Each participant in the network not only operates a metadata repository, but these indexing services as well. The index server periodically creates an index of the local metadata and supplies this index to any requesters via the Common Indexing Protocol (CIP) [11]. The indexing client periodically contacts other nodes' index servers in the Isaac Network to request index information. The client then uses this information to generate query routing information (referrals).

Figure 2. The Index Service.

 

By default, the index service is set up to generate index information for the five Dublin Core fields that we have designated as mandatory for all collections included in the Isaac Network: Title, Author, Subject, Description, and Identifier (URI). The index consists of a vector of terms from each field, which is often called a centroid. The indexes currently use the tagged index object format [12] but may easily be modified for experimentation.

4.3 THE SEARCH SERVICE

The search service allows the user or any LDAP-enabled application to search every repository in the network in an efficient manner. Using the referral (query routing) information maintained by each repository, queries are routed to appropriate repositories and processed in parallel. Beyond the standard, low-level LDAP searching applications, the Isaac System also provides a configurable web-based interface that is aware of repository-specific schema information. The user-friendly interface allows searching on any of the required Dublin Core attributes when performing a distributed search, but also allows extended searching when performing a search on a particular repository.

Figure 3. The Search Service.

 

5.0 RESULTS AND FUTURE DIRECTIONS

The Isaac Network currently consists of three nodes and collections. Our own node (Signpost), one at Cornell University in the US (InSite), and one at the Lower Saxony State and University Library, Goettingen, in Germany (SSG-FI MathGuide). We have received interest from more than one dozen other institutions, but we are proceeding adding nodes only slowly.

The combined collections currently contain about 3,900 metadata records and are available for searching, but we have not widely advertised the availability of the search service. We expect to add a fourth node and collection that will increase the size of the combined collection to more than 10,000 records in February, 1999.

Search and referral results

Because one of the goals of the Isaac Network is to utilize and investigate the effectiveness of using indexes to refer (route) queries to appropriate servers, we have collected data which allows us to analyze the effectiveness of query referrals.

One interesting metric is the percentage reduction in processed queries across all servers in our system compared to a similar system using a broadcast search policy. In other words, in a broadcast search system all servers must process all queries, but in the Isaac Network some servers are not contacted because the index information indicates that those servers do not have a possible answer. We have found that the average number of queries processed (i.e. servers contacted) in the Isaac Network is 26% less than it would be in a broadcast search system.

It is intuitive, yet interesting to note that as the complexity of the query increases, the number of servers contacted decreases. Figure 4, shown below, is a table that shows query complexity as determined by the number of search terms in the query and the reduction in servers contacted relative to a broadcast search system.

Figure 4. Query complexity and reduction in searches.

Number of search terms Reduction in searches over broadcast
1 23.4%
2 26.9%
3 38.9%
4+ 48.5%

Another interesting metric is what we term a "bad referral." This metric is closely related to the reduction in processed queries. A bad referral is, in general, a referral that sends the searcher to a server that contains no answers to that search. More specifically, we constrain this term here to describe those referrals that, for searches that do return some results overall, refer the searcher to a server that does not return results for that search. We constrain it here because, in our opinion, searches that return no results from any server are less interesting in terms of referral behavior.

Our data shows that only 21% of all referrals are bad referrals. This means that a large majority of the referrals do, in fact, send the searcher to an appropriate server, but some do not. In an ideal system, this number would be zero meaning that the searcher is never referred to a server that contains no results that match the query. We believe that this metric is a good measure of the effectiveness of the index system. It is apparent that in a system where each server has full knowledge of the contents of each other server, the bad referral rate would be zero. The tagged index object format [12], in use by our system, contains compressed knowledge of the contents of the other servers and results in the aforementioned 21% bad referral rate. It should be noted that the Isaac System currently discards the record identifier information contained in the tagged index object [12]. An interesting future experiment will be to incorporate this identifier information in referral generation with the expectation of a reduced bad referral rate.

Beyond the search and referral results, several other issues have arisen already through the establishment of the Isaac Network. They include:

  • metadata issues, such as the use of different subject schemes by different collections
  • network issues, such as performance of international links, and
  • porting issues, such as support for threading libraries under different UNIX variants.
Metadata issues

All three collections use different schemes and controlled vocabularies for subject information. Our own collection, Signpost, uses Library of Congress subject headings and Library of Congress Classification codes. SSG-FI uses the Mathematics Subject Classification (MSC) and MSC Verbal schemes, and InSite uses keywords from a 61-term controlled vocabulary as well as free-text keywords. So, while we allow searching on the DC Subject field, users are not likely to receive the results they expect. For example, Signpost contains a record for the Code of Hammurabi with DC Subject field containing "(scheme=LCC) KL" (KL is the classification code for History of Law). InSite contains a record for the same resource, but with DC Subject field containing "Legal History." Therefore, a search of the Isaac Network using DC Subject would not return both of these resources. It would be of high value to users of the Isaac Network if we could devise a method by which we can rationalize the varying subject classification schemes used by the various collections. This would be a major undertaking, however.

Performance issues

We have had some performance problems when accessing the SSG-FI collection in Germany. Network congestion is probably to blame, as the servers are fast machines that seem to quickly process the queries, but the packet loss rate is often upwards of 30%. To alleviate this problem, we may look at using LDAP replication to replicate the contents of the collection to the US side of the international links.

Due to the performance problems, we have experimented in providing user feedback during the search process. We are using simple methods such as server-push to keep the user informed during the progress of a multi-node search. Unfortunately support for these methods are browser-dependent.

Porting issues

Currently the nodes in the Isaac Network run Sun Solaris/Intel, Sun SunOS/SPARC, and SUSE Linux/Intel. The major porting issue has been supporting different threads packages (Pthreads, LinuxThreads, and Sun LWP) in the multithreaded LDAP server. We have also tested the software in-house on Solaris/SPARC, FreeBSD, and Debian Linux.

Future directions

Our future plans include enhancements to the search service to support ranking of results and more powerful search capabilities such as proximity searching.

While LDAP supports add, modify, and delete operations, as well as searching, we currently do not support a "cataloger's interface" to the Isaac System. We are primarily working with organizations that have established metadata collections and so we have concentrated on providing tools (Perl scripts, etc.) to take metadata records from collaborators' existing collections and import the records into the metadata repository. In the future, we plan to provide a Web-based interface that will allow collection maintainers to add, modify, and delete metadata records within the repository. As part of this interface, we are looking at providing automated extraction of embedded Dublin Core metadata from Web resources.

Every Isaac node currently indexes every other Isaac node. As more collections are added to the network, the indexing information will likely become much larger than the local metadata collections. As we add nodes, this may become a problem. We made this design decision for simplicity -- in the current model, a query to any LDAP server should produce the same set of referrals to other servers. We may have to revisit this decision if it fails to scale well. We have a number of options that could be considered. For example, it would be possible to have nodes that contain only partial knowledge of the network, but the problem then becomes how to ensure that all relevant resources are returned for a given query. This model would be similar to that used by Harvest, with "brokers" that collect and supply index information to other servers. This introduces more centralization in the Isaac Network than we prefer, however.

6.0 CONCLUSIONS

The Isaac Network demonstrates the use of query routing and distributed indexing as a means of combining multiple metadata collections into a single, searchable entity. The use of LDAP for query routing seems to have been an appropriate choice, as it provides good flexibility in distributing the data, allows local maintenance of collections while making records available to other servers, and provides fairly good query capabilities. The protocol supports access controls and other features that may be useful down the road. While we currently supply an LDAP server as part of the Isaac software package, it should be possible for organizations to use commercially available LDAP V3 servers, such as Netscape's Directory Server to host Isaac nodes.

The ability to search a combined collection of high-quality Internet resources by field should be valuable to the research and academic community. The ability to select resources by subject or author rather than simply by full-text search should increase query precision. The fact that the collections contain only resources deemed authoritative by subject specialists should also provide fewer irrelevant query results. We hope that the usefulness of the Isaac Network for targeted resource discovery will grow along with the number of collections in the network.

7.0 REFERENCES

[1] Harvest: A Scalable, Customizable Discovery and Access System, URL http://harvest.transarc.com/

[2] STARTS: Stanford Proposal for Internet Meta-Searching, URL http://www-db.stanford.edu/~gravano/starts.html

[3] Standards in the CHIC-Pilot Distributed Indexing Architecture, URL http://www.terena.nl/projects/chic-pilot/tnc/paper.html

[4] ROADS: Resource Organisation and Discovery in Subject-based Services, URL http://www.ilrt.bris.ac.uk/roads/

[5] Arts and Humanities Data Service (AHDS), URL http://www.ahds.ac.uk/

[6] Architecture of the Whois++ service, URL http://www.ietf.org/internet-drafts/draft-ietf-asid-whoispp-02.txt

[7] Z39.50 standards are maintained by NISO, URL: http://www.niso.org/

[8] Internet Scout Project Signpost, URL http://www.signpost.org/

[9] RFC 2251: Lightweight Directory Access Protocol (v3), URL ftp://ftp.isi.edu/in-notes/rfc2251.txt

[10] The Dublin Core Home Page, URL http://purl.oclc.org/metadata/dublin_core

[11] The Architecture of the Common Indexing Protocol (CIP), URL http://www.ietf.cnri.reston.va.us/internet-drafts/draft-ietf-find-cip-arch-01.txt

CIP Transport Protocols: URL http://www.ietf.cnri.reston.va.us/internet-drafts/draft-ietf-find-cip-trans-00.txt

MIME Object Definitions for the Common Indexing Protocol (CIP), URL http://www.ietf.cnri.reston.va.us/internet-drafts/draft-ietf-find-cip-mime-02.txt

[12] A Tagged Index Object for use in the Common Indexing Protocol, URL http://www.ietf.org/internet-drafts/draft-ietf-find-cip-tagged-06.txt