Definition specs for Fedora Ingestion Service

From Biowikifarm Metawiki
Revision as of 13:05, 26 January 2010 by GiselaWeber (Talk | contribs) (Mapping between Wiki Key Start template records and Fedora)

Jump to: navigation, search

Introduction

Fedora commons, which is able to internally manage complex binary objects, is used by us only with respect to its ability to also manage external objects accessible through a URI. This page describes the details of ingesting metadata into the FEDORA Commons metadata repository.

This page is the result of intensive discussions between G. Hagedorn, C. Veja, and G. Weber summarized here for further discussion.

Data sources

We plan to ingest from various data sources, including RDF, xml and other formats. We definitely plan to support the MRTG format, and perhaps the Morphbank xml exchange format. As a first step, however, we are currently supporting the low-tech oriented wiki upload format.

Syntax and Namespaces

At the moment, we use a the Dublin Core namespace for a few metadata items and a proprietary "k2n" namespace for the remainder of fields defined in the Key to Nature metadata exchange agreement. Dublin Core is stored in the w3c schema (OAI) prescribed by Fedora, for the k2n namespace items we use RDF storage ("Fedora RELS-EXT").

However, work on the MRTG schema is ongoing (see, e. g., MRTG Schema v0.8 as of 2009-09) and changes in the namespace and URIs of metadata items must be expected prior to the conclusion of the object. In most cases, the semantics will remain approximately the same, but the name will change. For example, the item "k2n:Metadata_Modified" may ultimately be known as xmp:MetadataDate.

FEDORA PIDs

Fedora requires Fedora persistent object identifiers called PID. A regular expression for valid PIDs is: ([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|\.|~|_|(%[0-9A-F]{2}))+ . This specifies that it must contain a single ":" but not multiple ones, it may not contain a slash ("/"). The PID part after namespace-colon may contain underscores.

PIDs should further conform to the info:fedora URI scheme. This means among others, that any URI-escaped characters that do not need escaping according the definition of the "info" scheme must be un-escaped, and that all remaining escaped octets use UPPERCASE (%ff becomes %FF). The requirement that seems to be in addition to the regular expression above (regular expression: ([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|\.|~|_|(%[0-9A-F]{2}))+ ).

FEDORA PIDs for metadata aggregation

Many harvested data do not provide a reliable, persistent, and globally unique identifier. Thus many PIDs have to be generated by an algorithm, and for the purpose of metadata aggregation, the term "persistent" PID will have to be considered misleading...

Level 1: The PID is based on specific unique data where available, plus media item MD5 hash of the entire object where no better ID is available (as is often the case). Details:

  • All PIDs are in the K2N namespace (prefixed with "K2N:")
  • All providers MUST give a homepage and the PID will be derived from the homepage URI, by removing "http://" and replacing slashed with underscores. All characters should be lower-cased; the fact that this same-spelling-different cases paths may exist for different providers is considered of theoretical relevance only.
  • All collections MUST give a homepage. For the moment, the homepage of our own wiki-collected objects is always a wiki page. Media objects in wiki-based collections refer to their collection by wiki page name only. However, this name relates to URI, and FEDORA should always store the URI and use it for PID-generation as above. When ingesting from the Wiki TemplateParameterIndex, a Wiki-collection-page can be algorithmically converted to a URI (see BaseURL).
    • Lia Veja: Until now, in Fedora exist objects gathered from two sources: batch ingest and MediaWiki template. In first case, if Resource ID metadata exists, the PID could be rely on it. For the second case, the observation above related with PIDs length sill remain available. I considered these two cases for PIDs candidate when I proposed a PID based on MD5 hash code.
    • GiselaWeber: I understand the description above so that the ResourceID of a non-wiki-ingested collection has to be a URI. For the length, consider MD5 hash code as Lia proposed.
    • Applying MD5-hash only to ResourceID might not ensure an unique global PID.
  • Collection member objects (but not collections inside a collection) are (Short code for resource type) + "_" + Item-xml-record-MD5-hash. The codes are "sd" for sound, "im" for image, "mv" for Movie, "tp" for TaxonPage, "id" for Identification tool (all lower case).
    • Note: We thought about combining Collection and Media MD5 hash, but each hash being 32 bytes, and PIDs limited to 60 bytes, this is not possible.
Further thoughts: It may be desirable to maintain the identity of provider and collection objects over time, ever where we cannot maintain the PIDs of media items. Both Provider and Collection are enforced to come with good candidate globally unique identifies (homepage URI, Collection ResourceID or Collection Wiki page). It may be that the MD5 method is not appropriate here at all?

Level 2: To increase the persistence of PIDs, we may decide instead of basing PID on the entire metadata, to selectively base it on candidate ID items. One problem with this is that the URL is in some items a perfect ID, in others not (where the URL refers to a portal, browser application, or login service). Unfortunately, this is indistinguishable. Similarly, the Resource ID may be a perfect numeric GUID, URI/URN, or just a locally valid sequential number. This will have to be worked out.

    • Lia Veja: It is maybe suitable a priority-based algorithm for PIDs. This algorithm could be a function of metadata aggregation source and existence of some metadata items with a priority scale.
    • GiselaWeber: I agree with this. This is certainly necessary for cases where collections are updated sometimes via wiki and sometimes by non-wiki method. It may also be considered that the Item-xml-record-MD5-hash can be based on ResourceID of the member object alone if present, and searched with priority by the algorithm,so that a single collection member object can be updated with non-wiki method.
    • LiaVeja: This proposal had been demonstrated to not be reliable. Relying PID on metadata resource content diminishes PIDs persistence. One little change in metadata content could affect PID. A ResourceID only based calculus is not enough to ensure a unique PID, ResourceID should be global unique at the repository level.

Updating media items and collections

Over time, the aggregated metadata need to be updated, expanded, or removed. This process is complicated by the fact that requiring a globally unique ID for each media item would be a high barrier to participation for many potential providers. We do encourage providing us with ID but do not require them. Unfortunately, the URL given for a media item is often, but not always a stable ID; the most noteworthy exception are URLs pointing only to a login-portal, or to media browser start location where the item is accessible only after further human interaction.

In the absence of reliable media IDs, update management is complicated. We plan the following algorithm to manage updates:

For data coming from the wiki-push mechanism (where we have no "remove/delete resource commands") we plan the following:

  • A media collection verification service, that runs periodically (weekly?) to verify the existence of each collection resource. Removing a wiki collection page, or an attached file containing collection data, will trigger removal of the collection from FEDORA. Care must be taken not to misinterpret server-down-times or timeouts triggering the removal of a collection. Details to be discussed...
  • An addition and update service, that always removes and re-ingests entire collections on updating the metadata of a collection.
    1. Retrieve PID of Fedora object based on Collection URI / Wiki collection page of incoming data against. If found, remove this collection and re-ingest. Do not compare ResourceID.
    2. If failed, try to retrieve PID of Fedora object using the combination of Provider Page / URI and ResourceID of the collection object. If PID is found, do as above.
    3. If failed, optionally one may try to test the ResourceID whether it is a globally unique identifier. A recommended test is to check whether either a URI (http or urn based), or whether it is likely (having the complexity of) a 16-byte GUID (e. g. 3F2504E0-4F89-11D3-9A0C-0305E82C3301). The GUID condition may be simplified to: trim both side, strip whitespace and hyphens, assume complex ID if > 31 characters remain). If a PID is found, delete and re-ingest.
    4. If all failed, remove nothing: assume the collection is new.

Note that the delete-and-recreate cycle applies to all members of the collection - but not to the collection object itself. The collection object itself may also have been updated, but will always keep its identity.

For data coming in through OAI harvesting, if a Resource provides a globally unique ID in the item "Resource ID", we plan to be able to update single resources. However, this requires an agreement with providers to also inform about Resources to be removed from the index. The OAI protocol for metadata harvesting provides for such functionality. Since most partners are not yet ready to support this, we have given this functionality a lower priority.


Interaction between Ingest and Wiki TemplateParameterIndex

The TemplateParameterIndex offers the following methods:

  • Return all records of a specified template name (e. g. template=Metadata), present on pages that have been created or updated in a specified date-time period defined through "from" (date-time in xml-format, e.g. from=2009-01-25T16:30.01) and "to" (date-time in xml-format; if "to" is missing, all records until current time are returned)
    • the parameter "parameternames=BaseURL;Collection Page;Resource ID;etc.", allows to specify which parameters are desired to be returned. This will greatly speed up this first step.
  • Return all records of a specified template name (e. g. template=Metadata), where a specified parameter name (e. g., parameter1=Collection+Page) contains a specified value (e. g., value1=XYZ).

The Fedora ingest service

  • first queries the webservice for any Metadata and Infobox Organisation templates in a specified time period;
  • extracts the resource collections from this, saving the information of the update date-time (an optimization can be based on this, see transaction handling)
  • does its internal handling of collection to be removed
  • queries the webservice for all records, one query per affected collection
  • ingests each collection

Note that the TemplateParameterIndex is capable of harvesting more than one wiki (it handles both internal harvesting and API-based harvesting on external wikis). The index therefore provides a field "BaseURL", which, if combined field "page name" provides a valid URL to the wiki page of the page on which the template was found.

Transaction handling

The ingestion service should support:

  1. an automatic re-harvesting at a relatively small time interval (good responsiveness)
  2. a harvesting service at any time requested by an external uploading agent (more or less immediately)
  3. long and mass-ingestions, possibly lasting several processing hours.

Finding a compromise between the first and the last requirement is difficult and the 2nd requirement may lead to concurrent harvesting requests. Some amount of locking/transaction handling is therefore necessary.

We plan to define two time intervals: a) HarvestingInterval (typically 1 to 5 Minutes) and b) MaximumIngestDuration (24 hours). The service will reattempt to harvest updates from known providers (especially the TemplateParameterIndex) at HarvestingInterval. If a previous ingest is still running, it will skip the next harvesting, unless the duration of the previous ingest exceeds MaximumIngestDuration (in which case an error or crash is assumed). MaximumIngestDuration must be such that under normal server load the largest possible update can safely be processed.

Two implementation options are possible: a) A single service running continually, providing a messaging interface for external initialization and checking internally for the parameters, or b) every harvesting starts its own process, and signaling occurs via external logs.

The second case is believed to be more resilient against rare crashes of the service, provided that communication is implemented in a reliable manner. The following proposal attempts to describe a communication algorithm using a transactional mysql database, guaranteeing harvesting without intervals of skipped data even in the case of server crashes:

  1. The ingestion process starts by analyzing a "watchdog" table for previous entries (see below). The watchdog table contains HarvestedSource (usually a URL) and time process started, plus - if available - the process ID of the harvesting process.
    • If an entry is present, and the difference to the current date-time is less than MaximumIngestDuration, the process terminates.
    • If MaximumIngestDuration is exceeded, the watchdog entry is removed (and, if implemented, the previous process is killed based on ProcessID stored in watchdog record).
  2. A new ingestion is started by querying sources for updates, since the last successfull update stored in a "last_success" table.
    • "Last_success" stores HarvestedSource (URL), When (date-time), and Collection (String/URL). "Last_success" manages two kind of records: Last successfull request of updates (with Collection field empty) and last successfull re-ingestion (with Collections specified by name).
  3. A first watchdog entry is created (see above) (no entry is written into the success table yet).
  4. Collections affected by these updates are analyzed. For each collection:
    • Identify the most recent update date-time of any record within the collection. This would be simplified, if the TemplateParameterIndex could order its return value by modified data descending - shall we require this functionality?
    • Compare the collection-specific modification data against the successfully updated collection-specific records in "Last_success". If "Last_success" is later than the update date-time, no action is performed for this collections. Note: This will occur only in the case that multiple, very large collections are affected, one of which causes a problem (time out or other). The result would be, that the process does not finishes, and the entire updates would be processed again. Collection-specific bookkeeping can here avoid an endless loop of timeouts in cases where many collections are updated at the same time.
    • For each collection that does need updating, all records are being removed from FEDORA. After successful removal of old data, the watchdog entry is updated to the current time (doing this for each collection, and between removal and re-ingest simplifies estimation of MaximumIngestDuration and allows keeping it relatively low).
    • Each updated collection is queried, and ingested into Fedora.
    • For each collection, after success, a collection-specific date-time is written into "last_success".
  5. After all collections have finished:
    • The watchdog entry is deleted
    • A new general success date-time (collection field remains empty) is written into the "last_success" table (this date-time being the time up to which updates were queried - stored in memory - not the time of success).

Special handling of Template: Key Start

A set of templates enables wiki authors to create single-access identification keys (also known as dichotomous or polytomous keys) like they are used in biology, in a manner that is well readable, printable, but also re-usable by other software.

The interaction between TemplateParameterIndex and Key Start template has to be considered a little different by the ingest tool.

  • Metadata will be harvested by querying TemplateParameterIndex with Key Start parameter instead of Metadata parameter.
  • Similarly to Metadata, a wiki syntax clean up (convert italics '' to <i>, bold ''' to <b>, etc) of the content is desirable
  • A sort of translation for Key Start parameter schema to K2N metadata schema will be necessary in parse phase.
  • Some other metadata are implied and must be added. For details, see Mapping between Wiki Key Start template records and Fedora
  • There is no provider and no collection provided for these metadata. With the respect to Content Model Architecture of the digital objects structure in K2N Fedora Commons Repository, we might considered as follows:
    • provider for this kind of Identification Tools digital objects to be the wiki-URL that the Parameter Index tool reports, e.g. http://www.keytonature.eu, or http://www.offene-naturfuehrer.de.
    • collections could be defined automatically ??? Gisela, Gregor?
      Gregor: Could Collection be also simply set to the wiki itself, i.e. Provider = Collection? Or a separate page?
      Lia: It would be the same wiki page reported for Collection and Provider.
  • further, this will be ingested (with or without collection) as regular digital objects in K2N Fedora Commons Repository

Notice: If the key-state is incomplete alike ongoing editing work, TemplateParameterIndex Tool will report it many times as updated wiki page. The unpleasant situation occurs only if we bound all these keys in collection. The Ingest Tool workflow requires to purge and reingest often is necessary an updated collection.

Gregor: I don't understand the question above.
Lia: I hope this is clearer now.

Items generated during Ingest

The ingestion service will add an additional metadata item, HarvestingURL, storing the URL of the source from which the data are harvested. Items or collections will be removed from the aggregation service if the Harvesting URL over a longer period no longer returns data.

  • Lia Veja (UTCN): If this HarvestingURL will be added to the rest of metadata and ingested in Fedora repository, it should be necessary an external service that randomly read all the collections and items looking for this metadata item, some kind of "garbage collector".
  • For the third level of the ingest service, this Garbage Collector will be implemented.

Controlled Vocabularies and Validators

Considering metadata relate to Template:Metadata, some of them (Type, Subtype, Offline Use, Interactivity, Host Application, Target System, ID Tool Structure, Exchange Formats, Country Codes, Languages) could have fixed values implemented by means of a controlled vocabulary. This controlled vocabulary implementation required a xml file attached to the metadata element which values would be controlled by. All the xml vocabularies are hosted on a single folder, specified into the general configuration file. Other validations are related to multiple valued metadata elements, mandatory metadata elements (Best_Quality_URI is required), more then one comma in a regular phrase.

Metadata Aggregation Report

During the validation phase, a metadata aggregation report will be generated. The errors and warnings collected on the previous step Definition_specs#Controlled_Vocabularies_and_Validators are transformed following the mediawiki syntax. A mediawiki page will be then generated, categorized and linked to the collection page in mediawiki at which belong.

References


Appendix

Mapping between Wiki Provider records and Fedora

Whereas most metadata relate to Template:Metadata, provider information comes from Infobox Organisation. Thus a special mapping needs to be specified.

Map for matches between provider information and specific metadata:

  • name to dc:title or k2n:Description, depending on abbreviation (see hereinafter).
  • abbreviation to dc:title, if abbreviation exists
  • logo to k2n:Attribution Logo URL ( after resolving the wiki syntax into image-retrievable syntax)
  • type to k2n:Description or k2n:Caption (not to type which is <dc:type>Provider</dc:type>)
  • foundation to be ignored
  • location_country to k2n:Country Names
  • location_city to k2n:City or Place Name
  • homepage to k2n:Page Context URI
  • BaseURL and PageName form together k2n:Best_Quality_URI (mandatory Resource Metadata Exchange Agreement compliance)


Do not fill the Service_Attribution_URI, a provider has no provider.


Mapping between Wiki Key Start template records and Fedora

A special treatment will be carry out by ingest tool regarding Key Start template records. Map for matches between Key Start information and specific K2N metadata (see also Key to Nature metadata fields ) should be as follows:

  • id: (see below under k2n:Best Quality URI)
  • title to dc: title
  • language to dc: language
  • category: to K2N: General Keywords
  • geoscope: to K2N: World Region or Locality
  • audience: to K2N: Audience
  • description: to K2N: Description
  • source: to K2N: Published Source
  • creators: to K2N: Metadata Creator
    • Gisela: shouldn't this be Creator, not Metadata Creator for K2N?
  • collaboration limited to: PRESENTLY IGNORE
  • status: PRESENTLY IGNORE
  • initiated by: to K2N: Contributors
  • edited by: to K2N: Contributors
  • general review by: to K2N: Reviewer Names
  • nomreview by: to K2N: Reviewer Names
  • expert review by: to K2N: Reviewer Names
  • commonnames: to K2N: Common Names
  • parent key: to K2N: Taxonomic Coverage

To map the content to Key to Nature metadata, the following fields should be filled automatically:

  • dc:type = Identification Tool
  • dc:language = If language is not provided, the ingest tool will consider by default wiki page languages for the key. In the first stage, this will be provided by an external source ("en" for http://www.keytonature.eu/wiki/ BaseURL and "de" for http://www.offene-naturfuehrer.de/wiki/). It would be nice to generalize this, but it seems to be difficult to obtain the default language of Wikis through the Wiki-API.
  • k2n:Host Application = Web Browser
  • k2n:Interactivity = Dynamic
  • k2n:ID Tool Structure = polytomous
  • k2n:Copyright Statement = Copyright reserved by the contributing authors
  • k2n:License Statement = Licensed under Creative Commons cc-by-sa 3.0
  • k2n:License URL = http://creativecommons.org/licenses/by-sa/3.0/
  • k2n:Best Quality Availability = online (free)
  • k2n:Best Quality URI = (the URL of the current wiki page, if id-field is present, add "#" and the id).
  • k2n:Resource ID = identical to Best Quality URI above

Example: Considered the source harvested from:

<Key_Start>
 <BaseURL>http://www.offene-naturfuehrer.de/wiki/</BaseURL> 
 <ScriptURL>http://www.offene-naturfuehrer.de/w/</ScriptURL> 
 <MediaURL>http://www.species-id.net/o/media</MediaURL> 
 <PageName>Thalictrum_(Wiesenraute)_in_Mitteleuropa_(Ralf_Hand)</PageName> 
 <Harvested>2010-01-25 00:07:43</Harvested> 
 <Modified>2010-01-24 00:12:39</Modified> 
 <TemplateName>Key Start</TemplateName> 
 <Attachment>n</Attachment> 
 <id>Thalictrum</id> 
 <title>Thalictrum in Mitteleuropa</title> 
 <creators>Ralf Hand</creators> 
 <description>"*" = siehe separate Schlüssel für Unterarten</description> 
 <geoscope>Mitteleuropa</geoscope> 
 <audience>Experten, Interessierte</audience> 
 <collaboration_limited_to>Ralf Hand</collaboration_limited_to> 
 <category>Flora</category> 
 <commonnames>Wiesenraute</commonnames> 
 <parent_key>Ranuculaceae</parent_key> 
</Key_Start>

This metadata will be translated as follows:

<Key_Start>
 <BaseURL>http://www.offene-naturfuehrer.de/wiki/</BaseURL> 
 <ScriptURL>http://www.offene-naturfuehrer.de/w/</ScriptURL> 
 <MediaURL>http://www.species-id.net/o/media</MediaURL> 
 <PageName>Thalictrum_(Wiesenraute)_in_Mitteleuropa_(Ralf_Hand)</PageName> 
 <Harvested>2010-01-25 00:07:43</Harvested> 
 <Modified>2010-01-24 00:12:39</Modified> 
 <TemplateName>Key Start</TemplateName> 
 <Resource_ID>Thalictrum</Resource_ID> 
 <Title>Thalictrum in Mitteleuropa</Title> 
 <Metadata_Creator>Ralf Hand</Metadata_Creator> 
 <Description>"*" = siehe separate Schlüssel für Unterarten</Description> 
 <World_Region>Mitteleuropa</World_Region> 
 <Audience>Experten, Interessierte</Audience> 
 <Creators>Ralf Hand</Creators> 
 <Subject_Category>Flora</Subject_Category> 
 <Common_Names>Wiesenraute</Common_Names> 
 <Taxonomic_Coverage>Ranuculaceae</Taxonomic_Coverage>
 <Host_Application>Web Browser</Host_Application>
 <Interactivity>Dynamic</Interactivity>
 <ID_Tool_Structure>polytomous</ID_Tool_Structure>
 <Copyright_Statement>Copyright reserved by the contributing authors</Copyright_Statement>
 <License_Statement>licensed under Creative Commons cc-by-sa 3.0</License_Statement>
 <License_URL>http://creativecommons.org/licenses/by-sa/3.0/</License_URL>
 <Best_Quality_Availability>online (free)</Best_Quality_Availability>
 <Best_Quality_URI>http://www.offene-naturfuehrer.de
     /wiki/Thalictrum_(Wiesenraute)_in_Mitteleuropa_(Ralf_Hand)#Thalictrum</Best_Quality_URI>
</Key_Start>

Notices for the further third level implementation

During the tests, several other features highlights.

  • For the massive ingest, let's say over 10,000 digital objects, if the collection is already ingested in repository, the collection removal and the object preparation for ingest could be done in the same time, as a separated threads.
  • For the ingest, a watch-dog table could be very necessary. For every task, every job will have a separate record as a restart point. If the ingest process have been interrupted, the ingest will be reload from this restart point, in order to avoid unnecessary re-ingest of a large amount of objects.


(Return to MediaWiki_based_ingest_tool)