FEDORA Evaluation

From MetaWiki
Jump to: navigation, search

FEDORA is an open source, long term institution document management system (http://www.fedora-commons.org/).

Contents

Introduction

The primary pointer to consider FEDORA among the possible media data and metadata management system that can be found (especially among those used in library and information sciences) was that both EOL and BHL intend to use this system in the longer run (to our information, they are bound to do so by grant agreements). Choosing a system that is supported within the topic domain of biodiversity informatics promises many synergies.

Fedora would handle versioning and metadata of images and sounds, and would come free with web service (REST/SOAP) and API interfaces. It would probably take care out-of-the-box with the desire to implement an OAI-PMH conformance. Fedora seems to be capable of our needs of a metadata-repository-without-data, the white paper (http://www.fedora.info/documents/WhitePaper/FedoraWhitePaper.pdf) says "data streams could also be an external reference to web content".

  • Manol@Bikam: Fedora digital object can equally handle content that is locally or externally (through references) stored. A metadata-repository-without-data must be the usual case.

Searching service will index our objects with their properties and (remote) data streams (through disseminators), so we should not expect problems with storing/searching.

Fedora comes with only a rudimentary supporting application called 'Fedora Administrator' (see, e.g., http://www.fedora.info/download/2.2.1/userdocs/tutorials/tutorial2.pdf) that provides all API-M functionality for repository administrators. By using this tool it is possible to ingest, search for and retrieve, modify and purge data objects and build, search for and retrieve, modify, and purge behavior objects (Behavior Definitions and Behavior Mechanisms). This tools is well suitable for rare task or testing the behavior, but not for any productive use of Fedora.

The Fedora client distribution comes with several command-line utilities that can be used to run some common operations without bringing up the GUI or writing your own SOAP client. There are third party applications that provide Front End/GUI, Middleware and utilities functionality (http://www.fedora.info/wiki/index.php/Fedora_Tools).

Another list of FEDORA-based projects is http://fedora.info/wiki/index.php/Fedora_User_Interface_Projects.

Resources

Use cases

FEDORA can handle both managed files (its primary purpose) and external resources/streams. The relevant scenario for our use of FEDORA for the metadata search repository (WP4) is:

a) We mass-import URLs with associated metadata, such as we collected for key to nature in the previous months.

b) We then construct / customize a search interface on this metadata repository. This search interface is separated by some API/web service etc. method to be able to run in multiple or different context (e.g. portal).

c) the search functionality should allow asking predicate questions both on DublinCore elements and more specifically defined elements from other namespaces. POssible questions could be: "show me all dichotomous identification keys from italy that contain species of Quercus", which in Pseudo-SQL would relate to: Key_Type=dichotomous and Country="it" and Taxon="Quercus" OR Taxon Like "Quercus %"


Evaluation questions and discussion

1. Is any software available among the freeware tools on top of FEDORA that supports importing such metadata sets? How much effort do you estimate is it to get this working? Do we have to program this fully API-based, is there an example among the FEDORA tools that can be modified, or does a tool exists that (almost) does it?

  • Manol@Bikam: Objects can be ingested in either the Fedora Object XML (FOXML) format or

the Metadata Encoding and Transmission Standard (METS). The 'Fedora Administrator' client provides a GUI and command line utility for ingesting objects into a Fedora repository. We can write our own client to perform ingest using Fedora API-M, and the appropriate SOAP calls.

  • Also Fedora has Directory Ingest Service that constructs Fedora objects from uploaded Submission Information Packages ("SIPs") and ingests those objects into a Fedora repository...
  • Gregor: A Powerpoint presentation on using METS for mass ingest into Fedora warns: "To have any success with using METS to ingest objects to a Fedora repository, start with the Fedora METS examples and forget about standard METS files." The presentation describes two cases (1 METS xml file per object, 1 METS xml file for all objects). The latter would suits us better, the cited sources used diringest for this purpose and cites http://www.paradigm.ac.uk/workbook/ingest/fedora-diringest.html, which seems a good description with examples of the process (albeit for DC metadata alone).

2. FEDORA requires dublin core metadata, we need some more specific metadata. We have all necessary expertise to create a custom w3c xml schema or RDF schema, but given such a schema definition, how easy is it to use this with FEDORA?

  • Manol@Bikam: Our metadata are a superset of dublin core (DC). The data outside DC might be stored in Fedora by user defined object properties/datastreams.
  • Also there is a utility 'Batch Metadata Transform/Reload' that transforms an XML file with a custom DTD into Dublin Core; We can use this (source code) to customize our metadata import (ingest) into Fedora repository.

3. FEDORA does support already an OAI-MHP Publisher, so our data could easily be shared with other repositories. It does not provide a publisher though.

  • Manol@Bikam: I don't find Fedora service/tool that do OAI-MHP aggregation (OAI Provider Interface is only publisher indeed). I think it is trivial to develop our own OAI-MHP 'harvester' if we really need it. For example see http://www.openarchives.org/pmh/tools/tools.php for small tools (my.OAI,

Net::OAI::Harvester, OAIHarvester2 ...) that may be used to harvest data from repositories. Secondly we only need to ingest data to Fedora through API-M (may be).

4. Once the data (respectively the URIs) and the metadata are managed by FEDORA, how do you query them? Again, do we have to program this fully API-based, is there an example among the FEDORA tools that can be modified, or does a tool exists that (almost) does it? I assume FEDORA has API support for querying - is this already available through SOAP or REST web services?

Manol@Bikam: Fedora has 'Generic Search Service' (GSearch, http://www.fedora.info/download/2.2.1/services/genericsearch/doc/index.html) that supports REST and SOAP and has plugins for Lucene, Solr and Zebra - so we can search/access Fedora with browser or suitable client application which are broadly available.


Discussion in email 2008-04-02 from Manol:

> Gregor: The thing I don't figure out yet is what to do to make such a metadata extension schema most functional with FEDORA. Does it has to be RDF?

Manol: May be not, I didn't see RDF applicable to ingesting anywhere...

> Gregor: Or is this an advantage? I understand by now that METS can contain several metadata schemata, but I don't understand what you have to prepare in FEDORA to make sure that FEDORA understands this additional metadata.

Manol: If additional metadata is in managed datastream then Fedora must understand it. So far as I see DC is insufficient for all real needs and everyone use their specific metadata in addition to DC.

> Gregor: Gisela and I at some point had the suspicion that simple transformations (which only require xslt) could be defined in FEDORA, but I am not sure about this. Perhaps this is not relevant. The question could be: what is the easiest way to write xslt, which takes xml from several datastreams (DC, Fedora-native, Our K2N-Metadata) and combines this to a nice presentation?

> Gregor: It seems strange that in addition to BDO, you also with each object have to add a disseminator. I would expect that this would be handled generically for "all images" "all videos" etc. Perhaps this can be done in the content models (which I think in my head as "document types)?

Manol: Yes, you are absolutely right! I am also wondering at need to explicitly add disseminator to each object, rather that only set the model... But in next version this may be is cleared with Content Model Architecture...

> Gregor: I find lots of example how to use FEDORA with DC, but little with custom metadata. Can you point me to a good resource?

Manol: I explored this as application with custom metadata:


Workflow

Expected workflows (contributed by Manol):

(1) on setup: Key2Nature metadata --> Fedora input formats --> Fedora Ingesting (2) regularly: More data harvesting --> Fedora input formats --> Fedora Ingesting (3) regularly: Remote data updated --> Fedora input formats --> Fedora update affected objects

where:

  • Key2Nature metadata is data in Key2Nature metadata xml;
  • Fedora input formats are ingesting formats like FOXML, METS,... or some ingesting service;
  • Fedora Ingesting is the ingest process itself;
  • Remote data updated is data that already was ingested in Fedora (and Fedora references to it)
  • Fedora update affected objects is the process of updating some Fedora objects

Comment on workflow number (3): we probably need this flow for synchronizing already ingested data...


Time estimates

  • Setup Fedora (and its environment) with default settings: a few hours.
  • Create a schema for Key2Nature metadata (based on Resource Metadata Exchange Agreement, needs to be done anyways)
  • Create a custom document type for media with Key2Nature metadata xml data stream in addition to DublinCore metadata
  • Create an xslt that derives a minimal DC document from our metadata schema, wire this into the document type (I understand this relates to your comment on 'Batch Metadata Transform/Reload')
  • Convert the existing media metadata we have already aggregated into that schema

Manol: This is not very clear for me, I think this transformation ( k2n schema->[xslt]->custom document type) is in order to generate Fedora ingest(import) file format, isn't it?

To prepare Fedora for fast mass-import of digital objects and their metadata, we have to:

  • select the proper method of ingest
  • setup/tune Fedora storage (RDBMS that is in use)
  • tune/turn off services (like search service) that would slow up the process

(estimated effort for this: 10-20 days)

  • Test the search webservices, wire them to the external search user interface (which currently is planned to be implemented in Adobe FLEX 3).

To prepare Fedora for production use: - setup/tune Fedora storage (RDBMS that is in use) - tune search service - tune web service interfaces that Flex3 will contact with. (estimated effort for this: 10-20 days)


Test driving Fedora

Gregor: The biggest problem for forming a final evaluation of Fedora seems to be that Key to Nature desires to use more specific metadata than the very rudimentary, lowest-common-denominator Dublin Core standard, and that Fedora is highly geared towards supporting only Dublin Core. In all aspects (batch import, indexing, search facility) it seems possible to use extended metadata with Fedora using alternative methods (ProOAI instead of OAI-MHP, Resource Index = Semantic web/RDF Triple-Store instead of Fedora Search Interface), but I am still at loss finding documentation how in practive to handle this, and how it will work.

I therefore propose the following steps within the next 4 weeks (sic!):

  • Install Fedora (please document all steps and links to documentation used under FEDORA Installation;
  • Manually "batch import" an xml file with only a few items (3-10), where some DublinCore metadata (at least title) and some non-DublinCore metadata (at least scientific organism name list, Taxon group, CountryList) are present (please document all steps or issues, as well as links to the documentation used under FEDORA Batch Import;
  • test searching a combination like: "(dc:title contains 'GenusName' OR k2n:ScientificName starts with 'GenusName') AND (k2n:CountryList="ISO2LetterCode"), where "GenusName" is the name of a genus (for which a species name is in the imported data) and "ISO2LetterCode" is a country code like "it" for Italy under FEDORA Batch Import#Search tests.

The imported records should be constructed such that such a search combination can be tested. I could imagine something like (where the semicolon indicates multiple values, which can be searched independently, i.e. in xml are represented as repeated elements, NOT as semicola in a single string):

dc:identifier dc:title k2n:ScientificName k2n:CountryList URL
1 Flowers of Silene italica - it; ch http://ip30.eti.uva.nl/bis/flora/pictures/silene%20italica%20ov.jpg
2 Lichtnelke Silene italica it; ch http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg
3 Flowers of Silene italica - de http://flora.nhm-wien.ac.at/Bilder-P-Z/Silene-italica-1.jpg
4 Lichtnelke Silene italica de http://flora.nhm-wien.ac.at/Bilder-Thumbnails/Silene-italica.jpg
5 Melandrium Melandrium alba it; ch http://www.ckkaempfe.de/chr/2005-05-eifel/pd5241.jpg

For testing purposes, use "http://key2nature.eu/ns/test" as the namespace for our own metadata.

QUESTIONS:

  • Can we manage this?
  • Can we actually install Fedora on the WP4 Server? The server is currently bare bones..., nothing installed yet.
  • Who is a server expert able to install a Linux server? Gisela and Gregor are Linux-agnostic...
  • Or should we install Fedora on some test system and learn from that to perhaps do it better



Testing Results on Generic Search Service for Fedora 3.0.

by Lia Veja


After the installation, the REST-based web service is ready to work. In fact, Gsearch performs 5 operations:

  • updateIndex
  • gfindObjects
  • browseIndex
  • getRepositoryInfo
  • getIndexInfo

We are interested now only in the following 3 operations:

1. updateIndex – build the indexing file from scratch and rebuild the indexing file from FOXML files. This operation can be customized by editing the stylesheets from:

.../webapps/<WEBAPPNAME>/WEB-INF/classes/config/index/BasicIndex  and:
.../webapps/<WEBAPPNAME>/WEB-INF/classes/configBasic/index/BasicIndex

For testing purpose, we ingested into Fedora 3.0 37 different objects having as following structure:

DC datastream-unmodified (only DC:description and DC:title), RELS-EXT with default relationships (only for test purposing), and one datastream with K2N metadata, named DESC1, text/xml mime type. The exported .xml file from DESC1 datastream is:

<k2n:desc xmlns:k2n="http://key2nature.eu/ns/test-rels-ext/">
  <k2n:identifier scheme="URN">K2N:Antipathes_curvata_2</k2n:identifier>
  <k2n:rights type="use">unrestricted</k2n:rights>
  <k2n:ScientificNames>Antipathes cf curvata orange 3</k2n:ScientificNames>
  <k2n:CountryCodes>marine</k2n:CountryCodes>
  <k2n:CountryCodes>ch</k2n:CountryCodes> <!—testing purpose -->
  <k2n:CountryCodes>de</k2n:CountryCodes> <!—testing purpose -->
  <k2n:Title type="main">Genus Antipathes</k2n:Title>
  <k2n:Caption>Antipathes cf curvata orange 3</k2n:Caption>
  <k2n:SubjectCategory>Cnidaria</k2n:SubjectCategory>
  <k2n:LowestCommonTaxon>Genus Antipathes</k2n:LowestCommonTaxon>
  <k2n:Type>StillImage</k2n:Type>
</k2n:desc>

By tailoring the stylesheets files, Gsearch could be customized to indexing all datastream type text/html, text/pdf, text/xml, etc.

To see which fields are currently indexed according to the stylesheet-setup, go to browseIndex, e.g. http://212.201.100.117:8183/fedoragsearch/rest?operation=browseIndex.

2. browseIndex – This operation presents all the fields for search and could perform simple query, such all k2.ScientificNames:”value” (k2 is alias for namespace K2N).

3. gfindObjects – Is the search interface. The syntax for Gsearch is as following:

element1 RELOP element2 RELOP element3….RELOP elementn

Where RELOP must be AND ,OR and NOT (uppercase is mandatory), and element1…element2 can be any field indexed previously, i.e. k2.ScientificNames:value, k2.CountryCodes:value, etc.

  • The fields must be qualified by alias name of namespace.
  • The parenthesis (...) could be used for changing the operator’s precedence.
  • The value attribute is the field value, and the rules can be summarizing as:

The following expressions are equivalent:

  • k2.CountryCodes: Marine
  • k2.CountryCodes:”Marine”

The following are not.

  • k2.ScientificNames:”Podiceps auritus”
  • k2.ScientificNames: Podiceps auritus

In the second expression, it searches for both words: “Prodiceps” and “auritus”. It seems to behave like a default OR operation is performed between two expressions:

k2.ScientificNames:”Podiceps” OR k2.ScientificNames:”auritus”

Wildcards that can be used are:

* for multiple characters replaced
? for a single character replaced
(You cannot use a * or ? symbol as the first character of a search.)
~ for fuzzy search (roam, foam...)
"^number" for a boost factor that controls the relevance level of a document by boosting its term, i. e.:
  k2.ScientificNames:Podiceps^8 auritus

The above expression means that the word “Prodiceps” is more important than the word “auritus”.

Proximity Searches(“-“ symbol), range searches are allowed too.

If using wildcards, value is not allowed to be “value“.

On the test instance at the address: http://193.226.5.117/fedoragsearch/rest Gisela, Gregor and I tried queries like:

(k2.ScientificNames:Podiceps a*)  (35 hits)
(k2.LowestcommonTaxon:Podiceps a*)  (35 hits)
(k2.LowestcommonTaxon:Podiceps c*)  (59 hits)
(k2.LowestcommonTaxon:Antipathes c*)  (59 hits)
(k2.ScientificNames:Podiceps auritus) (2 hits)
(k2.ScientificNames:”Podiceps auritus”) (1 hit)
(k2.Type:stillimage OR k2.Type:sound) AND (k2.CountryCodes:Marine)(12 hits)
(k2.Type:sound) OR (k2.Type:stillimage)  AND (k2.ScientificNames:pha*
   OR k2.ScientificNames:anti*)(10 hits)
(k2.ScientificNames:Podiceps) AND (k2.ScientificNames:nig*)(1 hit)
(k2.Type:stillimage OR k2.Type:sound) AND (k2.CountryCodes:Marine OR
  k2.CountryCodes:W. Palea~ctic) AND 
(k2.ScientificNames:pha* OR  k2.ScientificNames:anti*)  (13 hits)
  k2.ScientificNames:”Podiceps auritus”-1 (1 hit)

After the tests, the conclusions are:

  • if the query will be given by an user interface, the syntax errors should be avoided;
  • Gsearch could return more results than we would need;
  • Gsearch should be used in combination with Resource Index Search to reduce the amount of results, when appropriate. This approach should be very carefully implemented and need to be discussed very soon.
  • Gsearch could be customized to fulfill what we are expecting from. Moreover, the search method could be overridden in Gsearch source code.

Discussion

(Please update the information above if possible, use this area for signed + dated discussion contributions not appropriate to add above.)

Manol@Bikam: In conclusion we suggest to use Fedora. It is a good candidate for our job. We prefer Fedora over the other options. Fedora is a complete and stable system, it is supported by a community, and can perform all that we need. On top of that it can be managed without difficulties addressing all our raised system management concerns.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox