FEDORA Batch Import

From Biowikifarm Metawiki
Jump to: navigation, search

Manual batch import test

Manual batch import is described in FEDORA_HOME/docs/userdocs/client/batch/batchtool.doc or http://fedora.info/download/2.2.1/userdocs//client/batch/batchtool.doc . With fedora-admin it is possible to create a batch of digital objects, to ingest such a batch or to do both combined .

To create a batch of digital objects a general template with data common to all objects of the batch is needed. This must be a Fedora METS or a Fedora FOXML XML document. The object-specific substitutions have to be in separate XML documents. There are demo files for mets-template, foxml-template (foxml-template.xml) and object-specifics (e.g. americanacademy.xml) in FEDORA_HOME/client/demo/batch-demo.

For a simple test I made a copy of the foxml-template.xml and 4 copies of one example of the object-specific documents. The example batch was intended to contain 4 objects with 3 datastreams in each object: "DC" (Dublin Core metadata, here dc:title and dc:identifier), "RELS-EXT" for metadata for the resource index(k2n:ScientificName, k2n:Country and k2n:Url) and "Image" as externally referenced image/jpeg file. There is also a disseminator with the same bDef and bMech for all objects. In this test I used the bdef "demo:27" and the bMech "demo:28" which belong to the demos delivered with Fedora and use a Java servlet "ImageManipulation" also delivered with Fedora.

In the copies of the example files, one can delete all the datastream elements which are not needed and fill in the data for the needed elements, e.g. the content of the DC and RELS-EXT datastreams and the external link for the Image are provided by each object-specific file.

The objects are created with fedora-admin Tools -> Batch -> BuildBatch. In the window which opens, one has to enter the template file, an input directory containing all and only the object-specific files, an output directory to hold all and only the created object files and a file path of own choice for object processing map (output file), a file which maps object-specs to objects built.

After successful building of the objects, they can be ingested with Tools -> Batch -> IngestBatch.

Search tests

iTQL

The example ingested in this way is also searchable for the k2n metadata ScientificName, Country and Url in the Fedora Resource Index Query Service. Having done the same batch import as described above for the example proposed in FEDORA Evaluation#Test driving Fedora, the following iTQL query (see e.g. here) in the fedora/risearch FindTuples user interface:

select $subject $title $identifier $ScientificName $URL $Country
from <#ri>
where $subject <http://key2nature.eu/ns/test-rels-ext/ScientificName> $ScientificName
and $subject <http://key2nature.eu/ns/test-rels-ext/url> $URL
and $subject <dc:title> $title
and $subject <dc:identifier> $identifier
and $subject <http://key2nature.eu/ns/test-rels-ext/Country> $Country
and $Country <tucana:is> 'it'


gives the following result in sparql:

<sparql>

<head>
<variable name="subject"/>
<variable name="title"/>
<variable name="identifier"/>
<variable name="ScientificName"/>
<variable name="URL"/>
<variable name="Country"/>
</head>
<results>

...

<result>
<subject uri="info:fedora/demo:K2NBatchtest2"/>
<title>Lichtnelke</title>
<identifier>demo:K2NBatchtest2</identifier>
<ScientificName>Silene italica</ScientificName>
<URL>http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg</URL>
<Country>it</Country>
</result>
<result>
<subject uri="info:fedora/demo:K2NBatchtest5"/>
<title>Melandrium</title>
<identifier>demo:K2NBatchtest5</identifier>
<ScientificName>Melandrium alba</ScientificName>
<URL>http://www.ckkaempfe.de/chr/2005-05-eifel/pd5241.jpg</URL>
<Country>it</Country>
</result>
</results>

</sparql>


The example 1 with the title "Flowers of Silene italica" is not included in the result because it has no ScientificName element. It seems, however, that there are no wildcards in itql, so that a query for "k2n:ScientificName starts with 'GenusName'" might not be possible.

RDQL

On the other hand, the query language RDQL (see e.g. here) can handle regular expressions, so that the following RDQL query:

select ?subject ?identifier  ?ScientificName  ?URL ?Country
from <#ri>
where (?subject <dc:identifier>  ?identifier ),
(?subject <http://key2nature.eu/ns/test-rels-ext/url>  ?URL),
(?subject <http://key2nature.eu/ns/test-rels-ext/Country> ?Country),
(?subject <http://key2nature.eu/ns/test-rels-ext/ScientificName>  ?ScientificName)
AND  ?ScientificName=~ /^Silene/


gives the result (in "Simple" format):

subject  : <info:fedora/demo:K2NBatchtest2>
identifier  : "demo:K2NBatchtest2"
ScientificName : "Silene italica"
URL  : "http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg"
Country  : "it"

subject  : <info:fedora/demo:K2NBatchtest2>
identifier  : "demo:K2NBatchtest2"
ScientificName : "Silene italica" URL  : "http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg"
Country  : "ch"

subject  : <info:fedora/demo:K2NBatchtest4>
identifier  : "demo:K2NBatchtest4"
ScientificName : "Silene italica" URL  : "http://flora.nhm-wien.ac.at/Bilder-Thumbnails/Silene-italica.jpg"
Country  : "de"


If the line "(?subject <http://key2nature.eu/ns/test-rels-ext/Country> "it")," is added to the query, only demo:K2NBatchtest2 is returned. A combined filter expression is also possible. The query:

select ?subject ?identifier  ?ScientificName  ?URL ?Country ?title
from <#ri>
where (?subject <dc:identifier>  ?identifier ),
(?subject <dc:title>  ?title ),
(?subject <http://key2nature.eu/ns/test-rels-ext/url>  ?URL),
(?subject <http://key2nature.eu/ns/test-rels-ext/Country> ?Country),
(?subject <http://key2nature.eu/ns/test-rels-ext/ScientificName>  ?ScientificName)
AND ((?ScientificName=~ /^Silene/) || (?title=~ /silene/i))


returns those objects whose ScientificName starts with "Silene" or whose title contains "silene" (case insensitive). For such a query it is necessary that the RELS-EXT of those objects without a ScientificName contains an empty tag: <k2n:ScientificName/>, otherwise they are not included in the result.

Improved Batch Import

In order to create a larger number of object-specific files for batch ingestion, one template object specific file was created with placeholders for all values specific for the individual objects. Currently, these values (metadata from the first metadata survey for secondary data) are stored in a table in a database. The values are read from the table and the placeholders replaced with the values specific for each object by a Java program. In this program it is also possible to not just replace the values inside the existing XML elements in the template, but to write the XML elements for the values. This is necessary if a metadata field contains a list of values, so that several elements of the same type are needed. A metadata value can also be represented as an object-to-object relationship in the RELS-EXT datastream. For example, if type =StillImage, this can be expressed by the line

<fedora:isMemberOf rdf:resource="info:fedora/demo:StillImageCollection"/> 

in RELS-EXT.