您的当前位置：首页 Making Global Digital Libraries Work Collection Services, Connectivity Regions, and Collect

Making Global Digital Libraries Work Collection Services, Connectivity Regions, and Collect

来源：华佗小知识

Making Global Digital Libraries Work:

Collection Services, Connectivity Regions, and Collection

Views

Carl Lagoze, David Fielding, Sandra Payette

Department of Computer Science

Cornell University

E-mail: {lagoze, fielding, payette}@cs.cornell.edu

ABSTRACT

There are many technical challenges in designing thearchitecture of globally-distributed, federated digitallibraries. This paper focuses on the problem of globalresource discovery and describes a service architecture andserver topology for improving the performance andreliability of that process. The technique described isbased on three concepts. Connectivity regions are groupsof sites with relatively good network connectivity.Collection services provide the necessary meta-information so that a group of digital library servers caninteroperate as a collection. Collection views represent theconfiguration of the collection that conforms toconnectivity regions. The work that is described here isbased on experience with the NCSTRL internationaldigital library of computer science research and isimplemented as part of the Dienst architecture upon whichNCSTRL is based.

KEYWORDS: digital library architecture, distributed

searching, case studies

INTRODUCTION

For the past several years the Cornell Digital LibraryResearch Group has been investigating the architecture ofglobally-distributed, federated digital libraries. In contrastto centralized or replicated stand-alone systems, thesefederated systems are composed of semi-autonomousservices, distributed across the global Internet, thatinteroperate through an open protocol.

From the point of view of flexibility, extensibility, andscalability this federated model is preferable to self-contained, centralized systems (such as the currentgeneration of library management systems that form thetechnical basis of modern libraries). Among the benefits ofthe federated model are:

1. Stakeholders can maintain control of digital objects

(documents) in their own repositories.2. Customized collections can be created by aggregating

digital objects in these distributed repositories.3. New value-added services can be created as the need

arises.4. The functionality of existing services can be enhanced

in a modular fashion.5. Services can be replicated to provide global

accessibility.6. Customized user interfaces (digital library gateways)

can be created to provide community-tailored accessto other distributed digital library services.These advantages gained by modularity, interoperability,and distribution should not come at the cost of decreasedusability or performance. As much as possible, usersshould be as insulated from the physical distribution of thesystem and should be able to view the digital library as asingle collection with uniform tools for search, retrieval,and display of information within. At the same time, theperformance of the system should match user exceptions.This \"illusion of uniformity\" should be maintained,whenever possible, in the face of poor and inconsistentnetwork connectivity, variability in server load,inconsistent server administration, and other problemscharacteristic of distributed, decentralized systems.Maintaining usability in the presence of such distributionis one of the key challenges for designing digital libraryarchitecture. Some of the aspects of this challenge havebeen extensively covered in the distributed systemsliterature [1]. However, issues of global scale and a highdegree of component autonomy change the flavor of thedigital library problem sufficiently to call for some newsolutions.

In this paper we examine one aspect of the distributeddigital library problem - distributed searching. In thepresent World Wide Web, virtually all tools for resource

discovery are based on a centralized model. Typically, acentral service creates and deploys a master index, andsometimes creates one or more replicas of the index.Although this model is currently prevalent, we argue thatdistributed searching will become increasingly necessaryto overcome the constraints inherent in the centralizedmodel. In particular, effective architectures for distributedsearching must be developed to address:•

Issues of Scalability explodes, it has become increasingly difficult to- As the global information spacecollect indexing information and keep centralizedindexes up-to-date. Commercial Web search providersare beginning to recognize this fact and it has evenlead to a recent commercial patent for distributedsearching technology [2].

•

Issues of Specificity search or indexing sites is important, it is also vital- While interoperability amongthat the information infrastructure accommodates theunique needs of specific communities. Suchaccommodation is best accomplished via separateservice providers that can both cater to individualcommunity needs (through custom metadata,specialized data formats, query languages, userinterfaces, etc.) and interoperate on a global scalethrough open protocols.

•

Issues of Intellectual Propertyinfrastructure depends on the fact that almost all the - The current searchitems in the global information space are notencumbered by access restrictions. Certainly this willchange as improved technology for digital objectrights management evolves [3] [4] and, as a result,more restricted objects proliferate on the net. (In fact,one could argue that in the future the objects with themost value will be those that are not freely available.)In this case, it will become more difficult if notimpossible for centralized search providers to collectindexing information by simply walking the globalinformation space (in the fashion of current \"webspiders\"). As a result, resource discovery will dependon distributed indexing sites that are physically,logically, or legally linked (through licensingagreements) with sites of content providers.

Other researchers have investigated a variety of issuesrelevant to distributed searching. The distributed databasecommunity has a long history of investigating the optimaldistribution of indexing information across LANs andcontrolled WANs [5]. Researchers in the digital librarycommunity have examined query translation issues [6],content summarization for query routing [7], and protocolsfor meta-searching and metadata collection [8].

This paper describes an architecture, and experience withthat architecture, for distributing index servers1 on a global 1 Throughout this paper an index server is a server thatcollects meta-information about objects in a the digitallibrary collection and returns results (hit lists) in response

scale and disseminating meta-information on the locationof those servers among participating servers. Thearchitecture has three logical components. The first is adistributed collection service that identifies the indexservers of a distributed digital library collection andmanages meta-information about those servers. Thesecond is a connectivity region, which is a set of nodes onthe Internet with relatively good network connectivity(e.g., low latencies, infrequent partitioning). The last is acollection view, which is a perspective on the collectionspecific to a connectivity region.

The architecture described here was developed out of ourexperiences at Cornell building a production globallydistributed digital library. The structure of this paperreflects the development path of the Cornell work. First,we briefly summarize NCSTRL, our global testbed, andDienst, the technology on which that testbed is based.This section includes a description of the initial Dienstcollection service, which forms the basis for the expandedcollection service described later in the paper. We thendescribe our early efforts, or mistakes (depending on yourperspective), to deal with distributed resource discovery inNCSTRL. Following this we describe the currentevolution of our distributed searching architecture, with anexplanation of connectivity regions and how they areimplemented using an enhanced collection service. Weconclude by describing some future work andopportunities for research.

NCSTRL – THE TESTBED FOR A GLOBALLYDISTRIBUTED DIGITAL LIBRARY

The global digital library architecture described in thispaper is the result of our work with Dienst [9], a protocoland a reference implementation for distributed digitalobject libraries. The initial Dienst system was designedand developed as part of the DARPA-sponsored CS-TRproject [10], which investigated general digital libraryissues and, in particular, the technology for makingtechnical reports digitally available from the participatinginstitutions2.

At the conclusion of CS-TR funding, members andparticipants in WATERS [11], one of several other effortsto create a digital library of computer science technicalreports, joined with developers of Dienst and othermembers of CS-TR to form NCSTRL3 (NetworkedComputer Science Technical Reports Library). At thetime of publication of this paper (January, 1998),NCSTRL has grown to include collections from over 100institutions with over 60 servers world-wide. The globallydistributed nature of NCSTRL and the federated, openarchitecture of Dienst on which it is based, represents aunique testbed for ongoing digital library experiments.Those experiments are both of a technical nature, such asthose described in this paper, and of a social nature, to queries on that meta-information.

2 U.C. Berkeley, Carnegie-Mellon, Cornell, M.I.T., andStanford.

3 Pronounced \"ancestral\".

WWWbrowsersend search requestreceive unified hit listUser Interfacesend site specific search requestreceive hit listsend document requestreceive MIME-typed documentsend document requestreceive MIME-typed documentIndexRepositoryIndexRepositoryIndexRepositoryFigure 1 - Dienst Servicesexploring the organizational aspects of loosely federatedsystems.

DIENST ARCHITECTURE

The remainder of this section summarizes the Dienstarchitecture that underlies NCSTRL. A more detaileddescription of Dienst can be found in the ImplementationReference Manual [12]. The fundamental features of thearchitecture are a logical document model, distributeddigital library services, and an open protocol forinteroperation among those services.

Logical Document Model. At the core of the Dienst

services. Although the services are modular in nature,they are currently implemented as a single physical server.This mapping was merely a matter of expediency, and ourcurrent research and development efforts are motivated byour belief that the digital library service structure shouldbe physically, as well as logically, modular.

There are three core Dienst services, and one collectionmanagement service that we describe in the next section.The core services are:•

the Repository Service that stores and provides accessto documents, identified using the global namingservice and structured via the document modeldescribed earlier,

the Index Service that stores indexing (meta)information about documents in the collection andresponds to queries on this indexed information, andthe User Interface Service that provides a humanfront-end to the other services.

architecture is the notion of a document, a logicalabstraction4 that incorporates a number of concepts.• •

Each Document has a globally unique name that isdefined using the handle service [13].

A document consists of a number of components. Thetwo components currently in use in NCSTRL are thebibliographic description and the \"body\" of thedocument.

•

• Each component is available in one or more formats.

For example, the body of the document may beavailable in PostScript, HTML, and as a set of TIFFimages.• A component in a format may be divided into a

number of decompositions. For example, the \"body\"available in PostScript format may be divided into\"pages\".

Digital Library Services. The functionality of the Dienst

Open Protocol. Dienst services and servers interoperateusing a well-defined protocol [14]. This protocol isstructured around the logical service notion describedabove. Each protocol request is framed as a verb to aservice.

Figure 1 illustrates the protocol-based interactions betweenthe core Dienst services for search and retrieval of adocument. The user interface service acts as the mediatorbetween the user’s browser and the multiple index andrepository services in a distributed collection. Requestsare made through verbs addressed to one of the three coreservices. For example:•

The Search verb of the index service returns a hit list(or brief citation list) of documents meeting specifiedsearch criteria in the respective index

architecture is logically divided among a set of distinct 4 By logical we mean that the abstraction is distinct fromthe \"physical\" one-to-one mapping of document to filethat exists in file systems (local or distributed), FTP, orHTTP (without CGI).

•

The disseminationFetch verb of the repository service returns a of a specified document identified byits unique identifier (or handle). Arguments to theFetch verb conform to the concepts in the Dienstdocument model. For example, a Fetch request mayspecify that page one of the document body should bereturned in GIF format.

The use of an open protocol has two key advantages.First, it allows other value-added services to beconstructed and interact with existing Dienst servers.Second, it allows individual services and, in fact, the entireDienst implementation to be replaced as other alternativeimplementations are developed. An open protocol is alsoa defining feature of the collection service we describenext. This service is critical to the management of adistributed digital library collection in the Dienstarchitecture.

The Collection Service: Defining the Contents of theDigital Library

A distinguishing aspect of a library (digital or otherwise) ismanagement of collections. Management of the collectionbegins with selection of the objects to be included in thecollection. Objects are selected from a global informationspace (e.g., the set of all published books, or the set of allobjects on the Internet), and become constituents of librarycollections based on criteria applied by selectors orcollection managers. Depending on the sophistication ofthe library, there may be other collection managementfunctions such as preservation, archiving, and the like.Given this understanding of a library, the World WideWeb, by itself, is NOT a digital library. It represents a setof objects joined together technically (by the commonprotocol HTTP), but not by any collection managementactions. Similarly, a set of documents residing on serverscommunicating via the Dienst protocol do not compose adigital library.

Thus, digital libraries cannot be defined by the mereexistence or application of enabling technologies. Digitallibraries are distinguished from the more ubiquitousnetworked information landscape through theirincorporation of collection management services, whichmay involve human intervention.

Even with collection management, the definition of what isactually “contained” in a digital library can becomeambiguous. For instance, in the traditional library model,some librarians argue that physical containment of objects(e.g., in stacks) is the primary criterion for inclusion in thecollection16. This notion of physical control breaks downin the networked environment of digital libraries whereboth overt and implicit linkages can be made betweenobjects that reside in different physical locations. Forexample, if object A is included in a collection, are objectsB, C, and D that are linked to object A also included in thecollection? If so, are all objects transitively linked toobject A via other objects also included? The answer tothese questions has important implications in the areassuch as legal responsibility and public service.

While there are, undoubtedly, multiple perspectives on thedefinition of digital library collections, in this paper wewill adopt the following working definition. An object is\"in\" a digital library's collection if it can be directlydiscovered using the resource discovery tools defined andimplemented by the respective digital library5.

We emphasize the distinction between discovery andretrieval in this definition. First, discovery of the objectmay mean that a surrogate of the actual object may beindexed, and the actual object (which may be a physicalartifact) must be retrieved through other means. Second,assuming a global name space, any object in the globalinformation space may be retrievable (using its URN)without necessarily being in the library from which it isbeing fetched. One can think of this type of retrieval as atype of digital \"inter-library loan\".

Another interesting aspect of collection building is thelevel at which “inclusion” is evaluated. At the lowestlevel, individual digital objects are aggregated to form a(sub)-collection. At a higher level, multiple (sub)-collections of items are federated to form largercollections.

NCSTRL is a working example of multiple levels ofcollection management. The Dienst architecture providesfor institutional autonomy in item-level collectionbuilding, and the capability for institutions to federate intothe larger NCSTRL collection. The Dienst collectionservice is the mechanism for managing the federation levelof collection definition. The data for managing thecollection is obtained via protocol requests to this servicethat return the following information:•

The list of organizations that are part of thecollection. In NCSTRL the granularity of anorganization corresponds to the computer sciencedepartments and research institutions that aremembers of NCSTRL (e.g., Cornell ComputerScience Department, Georgia Institute of TechnologyCollege of Computing).

•

The network locationaddress and port of the Dienst index servers that store. The service provides theindexing information for each organization. Forexample, indexing information for Cornell ComputerScience may be stored at foo.ncstrl.org port 80 andbar.ncstrl.org port 8083.

•

Meta-information about each of the index serverspresent this meta-information indicates whether the. Atindex server should be considered primary orsecondary. However, our intention is to expand thismeta-information to include data about last update of 5 This brings up the interesting question whether the set ofobjects discoverable through one of the web searchservices is part of the digital library defined by thatservice. The nature of digital libraries and theircollections provokes many interesting questions.

the index, performance information, contentsummaries, and the like.

From the administrative perspective, the collection serviceallows easy management of the NCSTRL collection.Organizations join NCSTRL by submitting an applicationto our collection librarian6 via the Web. Subsequent to theconfirmation that the organization conforms to thecollection profile (the institution should be a Ph.D.granting institution in computer science) and theinstallation of a Dienst protocol conforming server, theDienst administrator at Cornell adds the institutionalinformation to the collection service tables. This newinstitution then becomes visible to each Dienst userinterface server after its next collection service request.We originally implemented the collection service on asingle Dienst server. In this configuration, the address andport number of the collection server is stored in theconfiguration file of each Dienst server. Periodically(every hour) each Dienst server issues a collection serviceprotocol request to obtain the collection information,which we described earlier. The requesting Dienst serverthen stores the collection information internally in a table.At this point the user interface services have access to thecurrent list of participating organizations (provided by thecollection server).

CollectionServerUIUser Interface1ServersIndexServersFigure 2 - Interaction between collectionservice and other Dienst ServicesFigure 2 illustrates the interaction between the collectionservice, user interface servers, and index servers in Dienst.As shown, each user interface server queries the collectionserver for collection information. For a specific query, anindividual user interface (labeled UI1 in the figure), usesthis collection information to determine which indexservers should process the query.

From a user perspective, the latest organizations appear on 6 Rebecca Wesley at Stanford University.

the search form provided by the user interface service.When composing queries to the NCSTRL collection, userschoose which organizations should be included in thesearch results. The respective user interface service candetermine where queries should be dispatch by using thenetwork location and contents data provided by thecollection service. Once this information is obtained, theuser interface service submits the actual to the target indexservers using a Dienst index server protocol request.When responses are return from the target index servers,the user interface service merges the responses into asingle result set.

The next section of this paper describes the use of thecollection server to implement two initial distributedsearch topologies in NCSTRL. The section that followsthen describes a distributed version of the collectionservice based on connectivity regions, enabling globallydistributed search.

THE EVOLUTION OF A DISTRIBUTED DIGITALLIBRARY : EARLY EXPERIENCE

The flexibility of the collection service and its interactionwith the user interface services allowed us to rapidlyexpand the NCSTRL collection from five sites in 1995 tothe over 100 sites that currently exist. In the course of thisrapid expansion, we implemented two initial distributedsearching topologies: simple distributed searching anddistributed searching with backup. In this section webriefly describe those topologies, and the lessons learnedfrom deploying them in NCSTRL.

Some of the architectural solutions described in thissection may, in hindsight, seem rather naïve and the resultspredictable. While that may be true, these solutions weredeveloped and retrofitted onto a rapidly growingproduction distributed system. In addition, the experiencegained from this incremental approach proved valuableand helped contribute to the architectural solutionsdescribed later in this paper. Finally, some have arguedthat given the present scale of the NCSTRL collection,centralized replicated searching remains as the morepreferable and predictable model [15]. This may also betrue. However, as we argued earlier in this paper, thedistributed searching problem will have to investigated forfuture digital library infrastructure to operate, andNCSTRL has been and still is a unique testbed forresearching those issues (production issues notwithstanding).

Simple Distributed Searching

The five institutions that participated in the CS-TR projectand that participated in the initial Dienst-based collectionshared two characteristics with implications for distributedsearching reliability:

1. Connectivity. Among the five institutions

connectivity was good; network down-time was

minimal and latencies were fairly low.

2. Commitment. Due to joint funding within the CS-TR

project, these five institutions shared a commoninterest and commitment to the success of the testbedtechnical report collection. As a result, the fiveservers and their contained collections were welladministered.7.

Based on the high technical and administrative reliability,we made the initial decision to implement a simpledistributed searching topology. In this topology only oneindex server existed for each organization in the collection.In fact, the index server was resident in the same Dienstserver as the document repository that it indexed. Asearch query from any of the user interface servers in thecollection was, regardless of the origin of the search,dispatched to the same set of indexing servers. If anindividual indexing server was unavailable (due tonetwork failure or server failure) or overloaded (resultingin a time-out) the user was alerted that results could not bereturned for the organization stored on that index server.

Figure 3 - Simple Distributed Searchwith Server FailureThis simple topology is illustrated in Figure 3, with aconnection failure to one of the index servers. The loss ofaccess to information resulting from unavailable or slowlyresponding servers was a motiviation to introduce thenotion of backup servers into the distributed searchingscenario.

Distributed Searching with Backup

Even with a controlled set of servers, as was the case in the 7 We strongly emphasize that the factors that contribute tothe success of a federated library are not restricted to thetechnical domain. We have found throughout theexistence of NCSTRL that poor management of a fewindividual servers in a federated system can seriouslydegrade the reliability and integrity of the entire system.Poor management can take a variety of forms including aserver that is periodically unavailable, descriptivemetadata that is incomplete or incorrect, a collection that isnot kept up to date, or any number of other factors.

original CS-TR project, server failures occurred too often.As the size of the collection grew beyond the original fiveinstitutions, the number of failures increased dramatically.In fact, most search result sets were incomplete, showingone or more \"unavailable organizations\".

In response to this situation, we soon introduced replicatedindex servers, with an ranking of which server wasprimary, secondary, etc. This was done by extending thecollection service protocol to indicate the priority order ofa specific index server for a specific organization. Forexample, the protocol response might indicate thatfoo.ncstrl.org port 80 is the primary index server for theCornell CS collection, but bar.ncstrl.org port 8083 is thesecondary index server for that same collection. Usingthis information, an individual user interface server couldthen first distribute the search request to the appropriate setof primary index servers. In case of failure or time-outs,the user interface could then distribute the same to thesecondary index servers corresponding to the \"unavailableorganizations\" in the primary phase of the search.

backupindex

Figure 4 - Primary and Secondary Index Servers

Figure 4 shows an example of this primary and secondary(backup) index server topology. In the illustration, one ofthe index servers has failed, and the query is redirected tothe secondary index for that site.

Adaptive Routing between Primary and SecondaryIndex Servers

Experience with the backup index server topologydemonstrated that in many cases poor performance orfailure of an individual server persists over time. Forexample, a network or server failure is normally notrepaired immediately. Rather than continuing to use afailing primary index server, it is preferable that the userinterface server \"remember\" the failure of the respectiveserver and change the rank ordering of the index servers inresponse.

To implement such behavior we implemented a simpleadaptive algorithm at each user interface server that keepstrack of the success or failure history of each index serverto which a queries are routed. If a specific index serverrepeatedly fails within a specified period and a secondary

I1I2I3,4I3I4R2I1,2R1Figure 5 - Connectivity Regionsindex server exists for the organizations indexed by thatserver, the unreliable server is \"demoted\" and theappropriate backup index servers \"promoted\". Thischange in rank is left in place for a fixed period, afterwhich the demotion and promotion are undone (but re-instated if the next retry results in another failure). In thismanner, the overall response time to queries is relativelyinsulated from the effects of unreliable servers.

CONNECTIVITY REGIONS AND DISTRIBUTEDCOLLECTION SERVICE

The concept of connectivity regions allows us to reframethe requirements for distributed searching in the followingfashion. In the absence of network or server failures,query routing from a specific user interface site should berestricted to those index servers in the same connectivityregion. In case of a failure, an alternative indexing servershould be chosen either in the same region or in anotherregion with which there is good connectivity.

Figure 5 illustrates a simple example of connectivityregions and the motivation behind them. In this figurethere are two regions, labeled R1 and R2. Each regioncontains one user interface server, which dispatchesqueries and combines responses, and three index servers,which respond to queries. In the example, the indexeddata in the collection is divided into four partitioned, andthe subscript(s) on each of the indexing servers indicatesthe partition(s) indexed at the index server. For example,index server I1 holds indexing information in partition 1and index server I3,4 holds indexing information inpartitions 3 and 4. As illustrated, indexing information isreplicated in a manner that queries can be routed within aregion in which a user interface server is located.However, as also illustrated, a failure in an index server ina region (I3 in R2) may require routing of a query to anindex server outside the region (I3,4 in R1).

A Distributed Collection Service and Collection Views

The addition of international partners to NCSTRL, and theresulting global deployment of Dienst servers, requiredrethinking the ranked index server topology described inthe previous section. As is well known, globalconnectivity varies dramatically. In fact, the latency timesbetween nodes can differ by several orders of magnitude.In addition, the patterns of connectivity are not necessarilygeographically related. Points that are coincident inphysical space may be \"distant\" in network space, asmeasured by reliability and speed of the connection. Thisdisparity between geographic and electronic \"proximity\"often corresponds to patterns of telecommunicationdevelopment over the past fifty years, which oftencorresponded to political and colonial patterns. Theexaggeration of this pattern is the fact the phone (andnetwork) connections from a developing country to itsformer colonial power are in most cases better than to itsneighbors (or, in fact, within its own country!).

We model the patterns of global connectivity through thenotion of a connectivity region. A connectivity region isdefined as a group of nodes on the network that amongthem have good connectivity8, relative to nodes outside ofthe region. At present, this definition is qualitative, but weplan to develop a more quantitative definition of theconcept. The meaning and purpose of the connectivityabstraction is orthogonal to whether the region is staticallyor dynamically (adaptively) defined.

8 For the remainder of this paper we will define thequality of connectivity as a factor of both latency andreliability (resistance to failure).

Earlier in this paper, we described how Dienst userinterface services use data from the collection service todetermine where to route queries. The routing is bothcontent based - which index servers can answer queries forthe organization(s) specified in the query - and prioritybased - which index server(s) should be consideredprimary, secondary, etc. for the specific organization(s).As described, the original collection service wasimplemented within one server. All Dienst user interfaceservers in the collection used that single server as thesource for collection data. Furthermore, the collectiondata supplied to each Dienst user interface server wasidentical.

In contrast, the connectivity region concept, illustrated in

Figure 5, implies that the routing decisions made bydifferent user interface servers may be based on differentcollection information. In the example, the user interfaceserver in region R1 \"believes\" that the primary source forindexing information on partition 1 of the collection is atthe index server labeled Iinterface server in region R1. On the other hand, the user2 \"believes\" that the primarysource for that information is at the index server labeledI1,2. In other words the collection view, the meta-information about the contents of the collection, of the Ruser interface differs from that of the R12 user interface. Asingle collection, such as NCSTRL, may have multiplecollection views, corresponding to the connectivity regionsthat have been defined for the servers in that collection.In order to support the notion of multiple collection views,we re-implemented the Dienst collection service in adistributed manner. In this new implementation, thedistributed collection service was divided into two logicalserver types.

1) Central Collection Server (CCS). There is a single

central collection server that serves as the centralpoint of management of the collection. This serverstores the following information:

a) In the same manner as the original collection

service implementation, the CCS contains a tabledefining all organizations in the collection.b) The list of Dienst servers (identified by host and

port) that are acting as regional collection servers.There is one regional collection server perconnectivity region.c) A set of collection views, each one corresponding

to a defined connectivity region. Each collectionview contains the list of index servers that shouldbe used (along with their rank orders) by the userinterface servers in that region. 2) Regional Collection Servers (RCS). As describedabove, there is one RCS per connectivity region. An RCSprovides the same collection information to the userinterface servers in its region as the original single-sitecollection service. That is, it returns to them the set ofrank-ordered index servers that they should use for queryrouting. Unlike the original implementation, the RCS getsthe information from the CCS, which returns the collectionview that corresponds to that region.

Figure 6 illustrates the interactions between the CCS andRCS and Dienst user interface servers. As shown, thecentral collection server (labeled CCS) contains internaltables that store, for each collectivity region, the serveraddress of the RCS for that region and the collection viewthat corresponds to that region. In the figure, the RCSlabeled S1 (which is configured with the CCS as itscollection server) submits a protocol request to the CCS tofetch a collection view. The CCS, recognizing SRCS for R1 as the1, returns the appropriate collection view. The

user interface server in Rits collection server, then receives the correct collection1, which is configured with S1 asview in response to its collection service protocol requestto Sto make routing decisions to index servers. It should be1. It then uses the information in this collection viewnoted that as network connectivity changes, the regionalview can be re-defined at the CCS level. A region’scollection view is modified once the RCS requests andreceives new collection data from the CCS.

One final implementation note. If a server not listed withthe CCS as an RCS submits a request to the CCS for acollection view, the CCS returns a view that is registeredas the \"default\" collection view. In this fashion, anyexternal service or agent can make use of the collectionservice for its own internal purposes.

S(I11 I2 I4,5)R1S1RCSS(I2CCS4 I5 I1,2)UIR2RCSI1I2I4,5UIR1R2Figure 6 - Interactions of CCS, RCS, and UserInterface ServerExperience with the Architecture

At the time of completion of this paper, we have fouroperating regions within NCSTRL. A server operated byMTA/SZTAKI in Budapest, Hungary acts as an RCS forDienst servers in Eastern Europe and Italy. A serveroperated by ICS/FORTH on Crete acts as an RCS forDienst servers in Greece. A server operated by GMD inBonn, Germany acts as an RCS for Dienst servers inNorthern Europe. A server operated by Cornell ComputerScience in Ithaca, NY acts as an RCS for North Americanand some European servers with good Trans-Atlanticconnections. We have found that connectivity between theWest (especially the San Francisco area) and East coastsof the United States is often as bad as between Europe andNorth America. Because of this, we are investigatingbreaking up the North American region into two orpossibly three regions.

The current configuration of regions was based mainly on

conjecture, informal experience, and the willingness ofparticular Dienst sites to assume the greater reliabilityresponsibilities required by an RCS. Thus, anyconclusions based on our experiences are preliminary. Inany case, we have found that the perceived reliability ofthe Dienst system, as measured from any of the userinterface gateways, has improved dramatically. Inaddition, the architecture has proven quite easy to manageand adjust. Modifications to tables in the CCS are quicklypropagated to the RCS’s and thus to the Dienst servers inthose regions. A server can easily be moved from oneregion to the next and the effect of unreliable servers canbe isolated.

Our initial implementation of connectivity regions hasuncovered a number of problems that we intend to addressin future implementations.•

The implementation was retrofitted on top of a Dienstprotocol and Dienst servers that pre-dated the regionalarchitecture. Because of this we had to make anumber of implementation compromises to avoid\"breaking\" legacy systems9. One example of aproblem that we have had is the imprecise andinsufficient information about database freshnesssupplied by existing Dienst servers. This has made itdifficult for us to propagate up-to-date replicas ofindexing data between index sites.

•

Connectivity problems remain troublesome. Forexample, the network speed between our HungarianRCS and other Dienst systems is sometimes so badthat it is impossible to update index servers in thatregion.

• Server administration problems make it difficult to

maintain index server integrity. When the primarysource of indexing information is frequentlyunavailable or the quality of records is inconsistent, itis impossible to maintain useful replicas of thatinformation.As one strategy for eliminating these problems we areplanning to logically segregate our production system fromour research testbed. In this manner we can maintainproduction NCSTRL services, perhaps with a morecentralized search strategy, and carry out research onisolated and controlled servers in the testbed. Ironically,the regional architecture can be used to create thissegregation - in effect breaking off \"production regions\"from \"research regions\".

Finally, researchers outside of Cornell have experimentedwith the regional idea. For example, the MeDoc projecthas adapted the concept for defining content-specificregions or collections [16]. Researchers at ICS/FORTHand in Greece IBM-Watson have used it in 9 This is not an uncommon problem with distributedsoftware for which a satisfactory solution will need to befound.

experimentation on QoS-based Searching and Retrieval[17].

CONCLUSIONS AND FUTURE WORK

As stated earlier in this paper future digital librariesarchitecture will have to address the problems inherent indistributed searching. They will have to do this in thecontext of global connectivity patterns. Our experiencewith NCSTRL has shown that the digital libraryinfrastructure must provide information that supportsquery routing decisions. Using this information individualservices can then algorithmically or heuristically decidethe \"best\" destination(s) for protocol requests.

In the process of implementing and deploying Dienst andNCSTRL we have developed a number of usefulabstractions for addressing this problem. This paper hasdescribed three concepts that together have allowed us toglobally distribute the NCSTRL collection.•

The gateways the location of servers to which resourceCollection Service defines for user interfacediscovery queries can be routed.

•

Connectivity Regions complete set of servers into groups with relativelydefine the division of thegood connectivity characteristics.

•

A framed as the location of index servers, thatCollection View is a definition of the collection,corresponds to the connectivity characteristics of aconnectivity region.

Our experience with NCSTRL has shown that theseconcepts are the basis of a scalable architecture for globalfederated digital libraries. In the future we plan to exploreseveral areas of research that build on these concepts, andapply and test them in a more formal manner.

Our initial configuration of connectivity regions was notbased on any rigorous analysis of network connectivity.We would like to collect inter-server performance data toenable more sophisticated analyses of the connectivitypatterns between Dienst servers that are part of NCSTRL.This could lead to the development more quantitativemetrics on how a connectivity region should be composed.In reality, connectivity between nodes on the Internet ishighly dynamic. While it is possible to statically configureconnectivity regions based on amortized network andserver behavior, it is preferable for the regions to adapt tochanging connectivity and server load. We are currentlyexploring methods for sharing this information amongservers, regional connection servers, and the centralconnection servers to allow dynamic region configuration.Finally, the whole area of adaptive query routing is fertileterritory for research. Earlier in this paper we described asimple algorithm for changing query routing in response toindex server failure. Failure history is one of manyattributes that should be use to influence query routing.

Others include real-time performance, cost, informationfreshness, or any combination of these. In the future weplan to investigate algorithms for routing based on variousfactors and the methods for disseminating and distributingmetadata necessary to make such routing decisions.

ACKNOWLEDGMENTS

The work described in this paper was funded by theDefense Advanced Research Project Agency under GrantNo. MDA 972-96-1-006 with the Corporation for NationalResearch Initiatives. This paper does not necessarilyrepresent the views of CNRI or DARPA. We thank BillArms at CNRI for his helpful feedback and support.Finally, we acknowledge the substantial contributions ofJim Davis at Xerox PARC to the entire Dienstarchitecture, which have enabled this and other researchwithin the Cornell Digital Library Research Group.

REFERENCES

1. Birman, Kenneth P., Building Secure and Reliable

Network Applications, Prentice Hall, 1997.2. Infoseek Patents Internet Search Technique,

http://info.infoseek.com/doc/PressReleases/patent.html.3. Arms, William Y., An Architecture for Information in

Digital Libraries, D-lib Magazine (February 1997),http://www.dlib.org/dlib/february97/cnri/02arms.html.4. Stefik, Mark, Letting Loose the Light: Igniting

Commerce in Electronic Publication, in InternetDreams, Mark Stefik ed., MIT Press, 1997.5. Chu, Wesley W., Optimal File Allocation in a

Multiple Computer System, IEEE Transactions onComputers, (October 1969).6. Chang, Chen-Chuan K and Garcia-Molina, Hector,

Evaluating the Cost of Boolean Query Mapping, inProceedings of the Second ACM InternationalConference on Digital Libraries, ACM Press (1997).7. Gravano, Luis, Garcia-Molina, Hector, and Tomasic,

Anthony, The Effectiveness of GlOSS for the Text-Database Discovery Problem, in Proceedings of the1994 ACM SIGMOD International Conference on theManagement of Data, ACM Press (1994).8. Gravano, Luis, Chang, Kevin, et. al., STARTS:

Stanford Protocol Proposal for Internet Retrieval andSearch (January 1997), http://www-eb.stanford.edu/~gravano/starts.html.9. Davis, James. R, Krafft, Dean, and Lagoze, Carl,

Dienst: Building a Production Technical ReportServer, in Advances in Digital Libraries ’95, Springer-Verlag (1995).10. Computer Science Technical Reports Project,

http://www.cnri.reston.va.us/projects/cs-tr

11. Maly, Kurt, French, J, et. al., Wide Area Technical

Report Service, Technical Report TR_94_13, OldDominion University (1994).12. Lagoze, Carl, Shaw, Erin, et. al., Dienst:

Implementation Reference Manual, Cornell ComputerScience Technical Report TR95-1514 (1995).13. The Handle System, http://www.handle.net.

14. Dienst protocol version 4.0,

http://www.ncstrl.org/Dienst/htdocs/Info/protocol4.html.15. Ed Fox, personal communication.

16. Adler, Steven, Berger, Uwe, et. at., Grey literature and

multiple collections, University of HamburgTechnical Report (January 1998).17. Sairamesh J., Kapidakis, S, et. al. A Performance

Framework for QoS based Searching and Retrieval inDigital Libraries, FORTH Technical Report TR204(1997).

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文