Amy Tracy WellsA Scout Report Signpost Look at One Aspect of Metadata - Resource Type

Amy Tracy Wells , Coordinator of the Scout Report Signpost

November 1997

This month we turn the End User's Corner over to Amy Tracy Wells, Internet Scout Project's Signpost Coordinator, for her view of one aspect of metadata (descriptive information about information). As will be pointed out, many types of information resources are accessible via the Internet. Signpost catalogs these resource types, as well as many other types of metadata related to each Scout Report annotation. Amy's tasks with Internet Scout include the design, creation, and maintenance of our Signpost Library of Congress Catalog of Scout Report annotations, as well as researching many aspects of metadata tagging for Internet resources and distributed information retrieval in general. [Jack Solock]

The Internet is an enormous information repository which continues to grow very rapidly. While we don't know the number of Web pages (or FTP and Gopher documents for that matter) on the Internet, in July 1997, Network Wizards published their latest "Internet Domain Survey" (http://www.nw.com/zone/WWW/report.html) which estimated the number of hosts (machines connected to the Internet) to be approximately 19,540,000. In the article, Network Wizards stated that this number is a "fairly good estimate of the minimum (emphasis added) size of the Internet."

Much of the discussion surrounding this growth focuses on whether the content of the Internet is legitimate. Those who believe, as we do at the Internet Scout Project (http://scout.cs.wisc.edu/scout/index.html), that much of the content has worth and utility are focused on developing public information gateways to the Internet's content. In order to develop these gateways, an important issue is to decide is what are the common characteristics of this information and then to develop metadata based on this knowledge. This analysis would then allow us to design and build gateways which are ideally end-user friendly and which have the possibility of inter-operating with other gateways. The purpose of this effort is to decrease the end-user's efforts.

Metadata is simply descriptive information about a resource. For example, President Clinton's 1997 State of the Union address is available on the Web. Metadata, or descriptive information, for this particular resource might include the author's name (William J. Clinton), title (1997 State of the Union Address: William J. Clinton), its location (http://www.whitehouse.gov/WH/SOU97/), publisher (U.S. White House), subject (Presidents -- United States -- History -- 20th Century), language (English), etc. The purpose of metadata, from an end-user's perspective, is that this information can allow someone to make a qualitative decision regarding the resource prior to selecting and/or reading the whole item. Knowing some of the descriptive information for a resource may be enough to decide whether it is worth a time investment.

The Scout Report Signpost (http://www.signpost.org/signpost/index.html), developed by the Internet Scout Project, was designed as an information gateway to the contents of the Scout Report (http://scout.cs.wisc.edu/scout/report/index.html). The Scout Report and the subject-specific scout reports are publications designed to guide the research and education community to quality Internet resources. The resources in Signpost have been cataloged and organized using metadata about each individual resource. Specifically, information is collected about the title, author, contributor (e.g. illustrator, editor, translator), subject headings, language, publisher, primary URL (location where the resource is available), etc.

An informal working group known as the Dublin Core (http://purl.oclc.org/metadata/dublin_core/) has developed a proposed schema or format for describing resources. The Dublin Core is composed of people in and out of academia, computer scientists as well as librarians. Their goal is to develop consensus on what metadata elements are important when describing information. The idea is that, if gateway designers can agree on standard schema for describing any given resource, that resource information can then be shared between gateways. What this could mean for the end-user is that a single query could simultaneously search multiple gateways and return a single "hit" list. Therefore, the user would not have to run multiple searches against multiple servers. Less time, less effort.

However, just as we don't know the number of Web pages on the Internet, we also don't know much about the types of resources that are available on the Web. Just as it is important to know the format of an item when using materials in a traditional library (e.g., film, book, puzzle, microfiche, etc.), knowing the format for an electronic resource (audio file, document, graphic, etc.) is equally important. Various descriptors have been proposed, and they range from the very general Dublin Core Resource Types Minimalist Draft (http://sunsite.berkeley.edu/Metadata/minimalist.html), dated July 17, 1997, which includes "text" and "image" to other more specific proposals which include descriptors such as "thesis" (with qualifiers such as: Honors, Masters, and Ph.D.), "unrefereed article," and "spectral data." Just as a film about a president would be very different from a puzzle depicting a president, so too would an unrefereed article about a president be different from a dissertation about a president.

There is an additional dimension to resource types. Given that the Internet distributes multimedia, knowing whether a resource requires the use of a very fast Internet connection or the use of speakers, as in the case for audiovisual materials, may determine if the resource can even be used. For these reasons the knowledge of what types of resources exist must precede schema definition. How can we define common schema for the exchange of resources, when we don't know what's "out there?"

An examination of Signpost reveals seventeen general resource types. These can be divided into two general categories: form (reports, databases, etc.) and intellectual content (educational materials, bibliographies, etc.). As of October 4th 1997, Signpost contained resources of the following types. The number of resources for each type is also listed.

Resource Type Number in

Animation /Video (May include RealPlayer, QuickTimeVR, .mpeg, Shockwave, etc.) 53
Audio (May include RealAudio, .wav, etc.) 46
Bibliography (Citation list of print and electronic sources) 81
Chart/Table (Predominantly includes statistics and/or maps) 120
Conference/Solicitation (Proceedings, requests for proposals, etc.) 39
Database (Searchable collection of information) 177
Dictionary/Encyclopedia (List of words or terms followed by definitions or explanatory articles) 31
Directory (List of contact information for individuals, organizations, etc.) 55
Document (Reports, articles, reviews, etc.) 680
Educational Materials (Curriculum development, course materials, etc.) 87
FAQ (Frequently Asked Questions) 9
Graphics (Drawings, photos, etc.) 247
Journal/Newspaper (Tables of contents, articles, and/or whole works) 169
Library Catalog (Interactive electronic catalog of library materials) 17
Mailing List/Newsgroup (Email-based discussion groups, electronic bulletin boards, etc.) 167
Meta-site (Guides to Internet-based resources) 135
Software (Applications, program files, etc.) 58

Each cataloged resource contains one or more of these resource types to guide users. For example, the "1997 State of the Union Address: William J. Clinton" is a document, but it is also available in audio format, so it is assigned both of these types.

Defining which resource types exist requires research. This research can be either site-specific, such as what has been detailed above, or can be a random sampling of multiple sites or of the Internet as a whole. What is detailed or what would be sampled needs to be defined and reported. Signpost is composed of resources taken from four Scout publications which have a defined audience: the research and education community with an emphasis on the U.S. scientific and engineering community. Any conclusions which might be drawn from the resource types in Signpost need to be understood within this context. Similarly a random sampling would also need to have defined parameters so that the reported results could be understood. Examples of parameters could be: U.S. or international resources; English-speaking or multilingual; research-level materials or all audiences; higher education or K-12; academic- and/or governmental- and/or commercial- and/or military- and/or organizational- and/or network-located resources. Parameters such as these must be known (or clearly defined) since any assumptions would also influence any conclusions.

While there is the temptation to granularize resource types, to be as specific as possible, this should be avoided. One of the resource types which has been discussed, though not a resource type used in Signpost, is "Home Page." But a Nobel prize winner would have a radically different Home Page than a fifth grader. Can both be successfully labeled simply as "Home Page?" Does the presence of such an ambiguous description enrich the end-user's understanding of the resource's content or form? Adding more detailed resource types should be optional.

The possibilites inherent in building large-scale gateways to electronic resources are tremendous. The technology for effectively doing so is now becoming available and effectively defining metadata is a critical step. However, if we are to realize wide-spread adoption of common resource types, they will need to be simple and useful for developers to implement, helpful in describing the resource, and easy enough for the end-user to understand.

Author's note: In March of 1998 another resource type was defined in Signpost. Specifically, "Data Set" which is used to describe partially processed or raw numerical data.
This article originally appeared as part of the End User's Corner, a featured column of InterNIC News, which was published monthly by Network Solutions, Inc. and InterNIC from May 1996 through March 1998. As of April 1998, End User's Corner will be published by the Internet Scout Project.

Copyright Susan Calcari and the University of Wisconsin Board of Regents, 1994-1998. Permission is granted to make and distribute verbatim copies of the End User's Corner provided the copyright notice and this paragraph is preserved on all copies. The Internet Scout Project provides information about the Internet to the US research and education community under a grant from the National Science Foundation, number NCR-9712140. The Government has certain rights in this material.

Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the University of Wisconsin - Madison or the National Science Foundation.

