Monday, June 30, 2008

Federated Search & Metadata Harvesting

Introduction

Digital library is concerned with that body of knowledge relating to the collection, organization, storage, distribution, retrieval, and utilization of digital information. Digital libraries basically store materials in electronic format and manipulate large collections of those materials effectively. In these days digital world is characterized by access to information rather than holding the information. Organized collections of scientific materials are traditionally called "libraries," and the searchable online versions of these are called "digital libraries”. when this dl is networked user want result from different libraries and sources which is distributed on internet for this there is need of that kind of searching facility which allow users to search multiple data sources with a single query from a single user interface. So here we will discuss searching services like federated and harvesting for accessing information. Library uses different protocols like z39.50,sru/srw,OAI-PMH.These protocols are often tied to services in that they are specific ways of implementing a service (i.e. one may create a harvesting service by using the OAI-PMH protocol), but they are not services unto themselves they are protocols.

Federated searching

Federated searching is a hot topic that seems to be gaining traction in libraries everywhere. Federated searching, also known as metasearching, broadcast searching, cross searching, parallel searching and a variety of other names, is the ability to search multiple information resources from a single interface and return an integrated set of results and provide on a single user interface so federated search will help user to reach information of their interest. Although aspects of this kind of shared searching has existed for some time (especially with Z39.50 catalogue searching), federated searching, that buzzword that has become so popular in the library world today is a technology that allows users to search many networked information resources from one interface. It queries a bunch of resources at once and then presents the results from all of them to the user.

Features of federated searching:

Support for multiple protocols (Z39.50, SRU/SRW, OAI).
Simple and advanced search (search by specific field).
Post processing of results (combined results).
Integration with other software (courseware, bib management tools).
Advanced result display (clustering, visualization).
Context-sensitive linking (OpenURL).

Open URL is a standard for persistently identifying content. Linking from a citation to the full-text. Finds which databases we have the full-text in and shows the user where it is (or takes them directly to it).

Steps of federated searching:

  1. A typical user types in a search query in the portal interface’s search box (user interface) and clicks on “search”;
  2. Query is sent to every individual database in the portal or federated search list that the user entered, along with the resources that the user wishes to search to the server
  3. The server looks at each resource that it is asked to search and calls the appropriate search plug-in;
  4. Portal returns the search results to the web server
  5. Then the federating searching tool collects the search results and presents it to the user.

To follow these steps Federated search software uses standardized protocols to access databases. The most common protocol used is Z39.50. Some target databases that do not comply with the Z39.50 standard can still be searched using "translator" programs that convert the query format of the federated system into the format of the native system.

Z39.50

Z39.50 is an information retrieval protocol based on client server model for searching and retrieving information from remote computer databases.

The distributed query approach in general benefits from the real time nature of the queries and produces fresher results from different resources, vendor, standards. It can also search flat text file available on internet. With the help of this protocol user can get full text and exact result it also helps in combined search like Boolean and proximity. A dl with this protocol can translates the user query, at the time it is presented, into acceptable queries for each DL and merges the resulting returned hits into one page that is presented to the user. Z39.50 helps in crosswalk between libraries, standards etc. if the server data provider is working. So it is good for interoperability and provides recent and exact result. Searching for and downloading bibliographic records using a Z39.50 tool is simple and very efficient since multiple sources can be searched simultaneously and records easily compared. It allow users to establish relationships with a variety of sources.

Metadata harvesting (OAI-PMH)

In the harvesting approach, a service provider periodically harvests metadata from data providers using a predefined protocol like OAI-PMH developed by open archive. This search service is based on the harvested metadata. The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content however; the work of OAI has expanded to promote broad access to digital resources for user.

Steps of metadata harvesting:

1. There is a service provider which periodically harvests metadata from data providers (repositories) using a predefined protocol.

2. Then harvest metadata from that repositories which using different protocols standards

3. Service provider (software) Come back with that harvested metadata and Indexes (process) them to provide a federation search service.

4. And provide result to user in on standard format.

For metadata harvesting OAI-PMH is best information retrieval protocol it is also based on client server model .this protocol uses harvester software to (service provider) harvest metadata from different libraries or repositories (data provider) service provider also done the work of indexing and abstracting and store that result in the database of the digital library and provide result to user on their query. So it provides fast result because it gives result from its database.

In federated search and metadata harvesting data provider provide data from heterogeneous data source and that have been indexes by a single search index and the result provided and the result. Here protocol Z39.50 provide latest result produce by the vendor (data provider) and OAI-PMH provide the result from harvested metadata reside on the internal database of a digital library.

So both of the protocols have different feature but both can use together for more latest and exact result to the user.

Feasibility of harvesting records using OAI-PMH and Z39.50:

The Z39.50 OAI Server Profile was developed to support harvesting of records using a simple OAI-PMH gateway. This profile describes how a Z39.50 server along with its associated bibliographic database could be turned into an OAI-PMH data provider by putting a gateway on top of the Z39.50 server that implements OAI-PMH (see Figure).

The gateway was designed to simultaneously act as a Z39.50 client and an OAI Repository translating OAI requests into Z39.50 requests and packaging the Z39.50 responses into OAI responses. This would require certain characteristics to be present in the underlying data structures and search mechanisms of the Z39.50 server implementations. In particular, it would require a unique identifier for each record, a way to provide a date stamp, and the means to retrieve records according to criteria specified in terms of these data. They include the ability to export all of the records in the database, the ability to sort by record identifiers and system transaction dates, and the ability to filter results by a variety of date criteria. Although attribute values are defined in the Z39.50 standard for these processes, it does not follow that any given system will support them, and in particular not the part of the system designed for library patron use. Therefore, it would require a major development effort for the vendors While the technique of harvesting directly from Z39.50 servers using OAI-PMH to obtain MARC records seemed to be an elegant solution in principle, developing relationships with the caretakers of these records and arranging for static harvests of records via FTP tools proved a more practical approach to procuring the records.

Limitation:

There are so many limitations with federated search or meta search like the problem of different standards using by different digital libraries problem of duplication in result but these problems are not that much big one but the problem of simplify the user’s experience, there are significant challenges in ensuring that precision and relevance of retrieval remain strong, and diverse opinions on how this should be accomplished. Is federated search the only solution to meeting those needs and expectations? What other approaches may be possible in a world of syndicated content via Really Simple Syndication (RSS) and Open Search as implemented in Amazon’s A9?

Really Simple Syndication (RSS):

RSS is a method that uses XML to distribute web content on one web site, to many other web sites.

It is a web site that wants to allow other sites to publish some of its content creates an RSS document and registers the document with an RSS publisher. A user that can read RSS-distributed content can use the content on a different site. Syndicated content can include data such as news feeds, events listings, news stories, headlines, project updates, and excerpts from discussion forums or even corporate information. RSS makes it possible for people to keep up with their favorite web sites in an automated manner that's easier than checking them manually Each RSS text file contains both static information about your site, plus dynamic information about your new stories, all surrounded by matching start and end tags. So if there will be some change happen in any site it send information about it to user. With RSS, information on the internet becomes easier to find, and web developers can spread their information more easily to special interest groups. In this way RSS will help user to reach information of their interest. Since RSS data is small and fast-loading, it can easily be used with services like cell phones or PDA's. So it can play the role of Z39.50 and OAI-PMH protocols.

Conclusion:

So in digital library the search requests to be federated or should the system be based on harvesting so that a library can provide result from libraries who are even using different standards (Dublin core, MARC etc.) where Z39.50 protocols doesn’t service provider and doesn’t process the result but able to provide latest and recent result but OAI-PMH protocol provide result faster than Z39.50.Z39.50 is not even to provide result if server is slow or dump or its name become change. In general, the harvesting approach has much better response time because only a local database has to be searched. The distributed query approach in general benefits from the real time nature of the queries and produces fresher results. So both protocols have some limitation to cope up with these problem one new technique came in face as RSS so with this user can get more information, recent information, in small easier, fast loading way. so library should select one from all of these or all of them according to need of there user.

with regards

Pallavi

1 comment:

Unknown said...

Nice dear...
keep it doing............