Difference between revisions of "FEDORA Batch Import"

From Biowikifarm Metawiki
Jump to: navigation, search
 
Line 5: Line 5:
  
 
To create a batch of digital objects a general template with data common to all objects of the batch is needed. This must be a  Fedora METS or a Fedora FOXML XML document.
 
To create a batch of digital objects a general template with data common to all objects of the batch is needed. This must be a  Fedora METS or a Fedora FOXML XML document.
The object-specific substitutions have to be in separate XML documents. There are demo files for mets-template, foxml-template ([[Image:foxml-template.xml]])  and object-specifics (e.g. [[Image:americanacademy Beispiel.xml]]) in  
+
The object-specific substitutions have to be in separate XML documents. There are demo files for mets-template, foxml-template (foxml-template.xml)  and object-specifics (e.g. americanacademy.xml) in  
 
FEDORA_HOME/client/demo/batch-demo.
 
FEDORA_HOME/client/demo/batch-demo.
  
 
For a simple test I made a copy of the foxml-template.xml and 4 copies of one example of the object-specific documents. The example batch was intended to contain 4 objects with 3 datastreams in each object: "DC" (Dublin Core metadata, here dc:title and dc:identifier), "RELS-EXT" for metadata for the resource index(k2n:ScientificName, k2n:Country and k2n:Url) and "Image" as externally referenced image/jpeg file. There is also a disseminator with the same bDef and bMech for all objects. In this test I used the bdef "demo:27" and the bMech "demo:28" which belong to the demos delivered with Fedora and use a Java servlet "ImageManipulation" also delivered with Fedora.
 
For a simple test I made a copy of the foxml-template.xml and 4 copies of one example of the object-specific documents. The example batch was intended to contain 4 objects with 3 datastreams in each object: "DC" (Dublin Core metadata, here dc:title and dc:identifier), "RELS-EXT" for metadata for the resource index(k2n:ScientificName, k2n:Country and k2n:Url) and "Image" as externally referenced image/jpeg file. There is also a disseminator with the same bDef and bMech for all objects. In this test I used the bdef "demo:27" and the bMech "demo:28" which belong to the demos delivered with Fedora and use a Java servlet "ImageManipulation" also delivered with Fedora.
  
In the copies of the example files, one can delete all the datastream elements which are not needed and fill in the data for the needed elements ([[Image:foxml-template Beispiel Ranunculus.xml]]). Since the disseminator is the same for all objects, there is no disseminator element in the object-specific files, only in the template:
+
In the copies of the example files, one can delete all the datastream elements which are not needed and fill in the data for the needed elements, e.g. the content of the DC and RELS-EXT datastreams and the external link for the Image are provided by each object-specific file.  
 
+
<foxml:disseminator BDEF_CONTRACT_PID="demo:27" ID="DISS1" STATE="A" VERSIONABLE="true">
+
        <!-- Note: The value for createdDate is assigned dynamically by the Fedora server at ingest time if the CREATED attribute is not        -->
+
        <!-- present on the component in the object being ingested. Therefore, it is recommended that this attribute be omitted from the          -->
+
        <!-- object template so this date will be assigned dynamically by the Fedora server. If included in the template file, this value will be    -->
+
        <!-- carried through to the built objects and will result in the specified static date being used at ingest time by the Fedora server          -->
+
        <!-- rather than dynamically assigning createdDate at ingest time.                                                                                                                  -->
+
        <!--                                                                                                                                                                                                                            -->
+
        <!-- Uncomment the the following line if you want the DISS1 disseminator for all abjects built with this template to retain the specified        -->
+
        <!-- createdDate after ingest.                                                                                                                                                                                -->
+
        <!-- <foxml:disseminatorVersion CREATED="2005-03-15T12:57:07.241Z" BMECH_SERVICE_PID="demo:2" ID="DISS1.0" LABEL="UVA Std Image Behaviors"> -->
+
        <foxml:disseminatorVersion BMECH_SERVICE_PID="demo:28" ID="DISS1.0" LABEL="Ranunculus disseminator">
+
            <foxml:serviceInputMap>
+
                <foxml:datastreamBinding DATASTREAM_ID="Image" KEY="url" LABEL="Binding to IMAGE" ORDER="0"/>     
+
            </foxml:serviceInputMap>
+
        </foxml:disseminatorVersion>
+
</foxml:disseminator>
+
 
+
 
+
 
+
On the other hand, the content of the DC and RELS-EXT datastreams and the external link for the Image are provided by each object-specific file ([[Image:Ranunculus_angustifolius_Beispiel.xml]]).  
+
  
 
The objects are created with fedora-admin Tools -> Batch -> BuildBatch. In the window which opens, one has to enter the template file, an input directory containing all and only the object-specific files, an output directory to hold all and only the created object files and a file path of own choice for object processing map (output file), a file which maps object-specs to objects built.
 
The objects are created with fedora-admin Tools -> Batch -> BuildBatch. In the window which opens, one has to enter the template file, an input directory containing all and only the object-specific files, an output directory to hold all and only the created object files and a file path of own choice for object processing map (output file), a file which maps object-specs to objects built.

Latest revision as of 14:19, 19 January 2010

Manual batch import test

Manual batch import is described in FEDORA_HOME/docs/userdocs/client/batch/batchtool.doc or http://fedora.info/download/2.2.1/userdocs//client/batch/batchtool.doc . With fedora-admin it is possible to create a batch of digital objects, to ingest such a batch or to do both combined .

To create a batch of digital objects a general template with data common to all objects of the batch is needed. This must be a Fedora METS or a Fedora FOXML XML document. The object-specific substitutions have to be in separate XML documents. There are demo files for mets-template, foxml-template (foxml-template.xml) and object-specifics (e.g. americanacademy.xml) in FEDORA_HOME/client/demo/batch-demo.

For a simple test I made a copy of the foxml-template.xml and 4 copies of one example of the object-specific documents. The example batch was intended to contain 4 objects with 3 datastreams in each object: "DC" (Dublin Core metadata, here dc:title and dc:identifier), "RELS-EXT" for metadata for the resource index(k2n:ScientificName, k2n:Country and k2n:Url) and "Image" as externally referenced image/jpeg file. There is also a disseminator with the same bDef and bMech for all objects. In this test I used the bdef "demo:27" and the bMech "demo:28" which belong to the demos delivered with Fedora and use a Java servlet "ImageManipulation" also delivered with Fedora.

In the copies of the example files, one can delete all the datastream elements which are not needed and fill in the data for the needed elements, e.g. the content of the DC and RELS-EXT datastreams and the external link for the Image are provided by each object-specific file.

The objects are created with fedora-admin Tools -> Batch -> BuildBatch. In the window which opens, one has to enter the template file, an input directory containing all and only the object-specific files, an output directory to hold all and only the created object files and a file path of own choice for object processing map (output file), a file which maps object-specs to objects built.

After successful building of the objects, they can be ingested with Tools -> Batch -> IngestBatch.

Search tests

iTQL

The example ingested in this way is also searchable for the k2n metadata ScientificName, Country and Url in the Fedora Resource Index Query Service. Having done the same batch import as described above for the example proposed in FEDORA Evaluation#Test driving Fedora, the following iTQL query (see e.g. here) in the fedora/risearch FindTuples user interface:

select $subject $title $identifier $ScientificName $URL $Country
from <#ri>
where $subject <http://key2nature.eu/ns/test-rels-ext/ScientificName> $ScientificName
and $subject <http://key2nature.eu/ns/test-rels-ext/url> $URL
and $subject <dc:title> $title
and $subject <dc:identifier> $identifier
and $subject <http://key2nature.eu/ns/test-rels-ext/Country> $Country
and $Country <tucana:is> 'it'


gives the following result in sparql:

<sparql>

<head>
<variable name="subject"/>
<variable name="title"/>
<variable name="identifier"/>
<variable name="ScientificName"/>
<variable name="URL"/>
<variable name="Country"/>
</head>
<results>

...

<result>
<subject uri="info:fedora/demo:K2NBatchtest2"/>
<title>Lichtnelke</title>
<identifier>demo:K2NBatchtest2</identifier>
<ScientificName>Silene italica</ScientificName>
<URL>http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg</URL>
<Country>it</Country>
</result>
<result>
<subject uri="info:fedora/demo:K2NBatchtest5"/>
<title>Melandrium</title>
<identifier>demo:K2NBatchtest5</identifier>
<ScientificName>Melandrium alba</ScientificName>
<URL>http://www.ckkaempfe.de/chr/2005-05-eifel/pd5241.jpg</URL>
<Country>it</Country>
</result>
</results>

</sparql>


The example 1 with the title "Flowers of Silene italica" is not included in the result because it has no ScientificName element. It seems, however, that there are no wildcards in itql, so that a query for "k2n:ScientificName starts with 'GenusName'" might not be possible.

RDQL

On the other hand, the query language RDQL (see e.g. here) can handle regular expressions, so that the following RDQL query:

select ?subject ?identifier  ?ScientificName  ?URL ?Country
from <#ri>
where (?subject <dc:identifier>  ?identifier ),
(?subject <http://key2nature.eu/ns/test-rels-ext/url>  ?URL),
(?subject <http://key2nature.eu/ns/test-rels-ext/Country> ?Country),
(?subject <http://key2nature.eu/ns/test-rels-ext/ScientificName>  ?ScientificName)
AND  ?ScientificName=~ /^Silene/


gives the result (in "Simple" format):

subject  : <info:fedora/demo:K2NBatchtest2>
identifier  : "demo:K2NBatchtest2"
ScientificName : "Silene italica"
URL  : "http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg"
Country  : "it"

subject  : <info:fedora/demo:K2NBatchtest2>
identifier  : "demo:K2NBatchtest2"
ScientificName : "Silene italica" URL  : "http://www.funghiitaliani.it/uploads/post-5-1141817448.jpg"
Country  : "ch"

subject  : <info:fedora/demo:K2NBatchtest4>
identifier  : "demo:K2NBatchtest4"
ScientificName : "Silene italica" URL  : "http://flora.nhm-wien.ac.at/Bilder-Thumbnails/Silene-italica.jpg"
Country  : "de"


If the line "(?subject <http://key2nature.eu/ns/test-rels-ext/Country> "it")," is added to the query, only demo:K2NBatchtest2 is returned. A combined filter expression is also possible. The query:

select ?subject ?identifier  ?ScientificName  ?URL ?Country ?title
from <#ri>
where (?subject <dc:identifier>  ?identifier ),
(?subject <dc:title>  ?title ),
(?subject <http://key2nature.eu/ns/test-rels-ext/url>  ?URL),
(?subject <http://key2nature.eu/ns/test-rels-ext/Country> ?Country),
(?subject <http://key2nature.eu/ns/test-rels-ext/ScientificName>  ?ScientificName)
AND ((?ScientificName=~ /^Silene/) || (?title=~ /silene/i))


returns those objects whose ScientificName starts with "Silene" or whose title contains "silene" (case insensitive). For such a query it is necessary that the RELS-EXT of those objects without a ScientificName contains an empty tag: <k2n:ScientificName/>, otherwise they are not included in the result.

Improved Batch Import

In order to create a larger number of object-specific files for batch ingestion, one template object specific file was created with placeholders for all values specific for the individual objects. Currently, these values (metadata from the first metadata survey for secondary data) are stored in a table in a database. The values are read from the table and the placeholders replaced with the values specific for each object by a Java program. In this program it is also possible to not just replace the values inside the existing XML elements in the template, but to write the XML elements for the values. This is necessary if a metadata field contains a list of values, so that several elements of the same type are needed. A metadata value can also be represented as an object-to-object relationship in the RELS-EXT datastream. For example, if type =StillImage, this can be expressed by the line

<fedora:isMemberOf rdf:resource="info:fedora/demo:StillImageCollection"/> 

in RELS-EXT.