Difference between revisions of "Batch importing media files into MediaWiki"

Latest revision as of 15:51, 2 June 2015

Note: This is about importing binary files like images from the command line. See also: Mediawiki XML page importing (i.e. import text for pages using an xml-format).

1 Importing binary files manually (maintenance importImages.php)
- 1.1 Import large files (eg. ZIP)
2 Problem removing files from temporary import folder
3 Writing file names in a folder under Windows to text file
4 Batch renaming files in Linux
5 Change ownershop from root to www-data
6 Preparing MetaData for image files from text files

Importing binary files manually (maintenance importImages.php)

It is important to set the media files owner to www-data for all import media files (sudo chown -R www-data:www-data ./my-import-media-folder). If the file's owner is root, no image scaling can be achieved.

#!bin/bash
# USAGE: php importImages.php [options] <dir>
# options:
# --comment=<text>      Set upload summary comment, default 'Importing image file'
# --comment-file=<file> Set upload summary comment the the content of <file>
# --comment-ext=<ext>   Set extension for comment file
# --dry                 Dry run, don't import anything (*but create the page*)
# --overwrite           Overwrite existing images with the same name (default is to skip them)
# --user=<username>     Set username of uploader, default 'Maintenance script'
# ... some more options
#######################################
# run php ./maintenance/importImages.php as user www-data 
#   with images from /var/www/v-species/o/my-import-media-folder and log it to
#   /var/www/v-species/o/my-import-media.log
#######################################
# step 0: prepare your media and store them into a temporary folder
#         in wiki openmedia (/var/www/v-species/o/my-import-media-folder)
#         make sure to have informative file names, e.g. “what” and “from whom”
#         “Zygiobia carpini Loew, 1874 on Carpinus betulus (Michal Maňas, 2013).jpg”
#         set owner to all import media files to www-data
#         cd /var/www/v-species/o/my-import-media-folder && sudo chown -R  www-data:www-data ./
#######################################
# step 1: go to wiki openmedia (root)
# hint in bash: \ means to continue the command line over multiple lines
#######################################
cd /var/www/v-species/o
sudo -u www-data php ./maintenance/importImages.php --conf ./LocalSettings.php --comment="{{Provider XX ZZZNAME}} 
{{Collection XX ZZZNAME}} 
{{Metadata 
 | Type  = StillImage 
 | Title         =  
 | Description   = {{Metadata Description de|1= }}{{en|1=}} 
 | Locality      =  
 | Identified By = 
 | Subject Sex   = female 
 | Scientific Names =  
 | Common Names     =  
 | Language         = zxx 
 | Creators         = XX ZZZNAME 
 | Subject Category = Amphibia 
 | General Keywords = 
}}" --user="XX ZZZNAME" /var/www/v-species/o/my-import-media-folder > /var/www/v-species/o/my-import-media.log
 # update indices / job queue
 cd /var/www/v-species/o
 # optionally rebuild the links and indices used for searching your site
 # sudo -u www-data php ./maintenance/rebuildall.php --dbuser wikiadmin --conf ./LocalSettings.php
 # manually force the job queue to run 
 # (sudo -u www-data AND --dbuser wikiadminCORRECT??)
 sudo -u www-data php ./maintenance/runJobs.php --dbuser wikiadmin  --conf ./LocalSettings.php --procs=3

If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".

The comment can contain REAL line breaks, since the text is inside double quotes but the comment is inserted on the page only the first time at the import. You cannot override page content via importImages.php --overwrite, in that case you must create an XML import file.

To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.

It is advisable to redirect output to a file (here > aaaimport.log), to be able to carefully check import success. To check script progress one can use the size of the output file (open a second ssh shell) to see if import has stopped.

For user names the name, not the ID counts. If the name does not exist yet, it will be imported nevertheless and stored like an IP-Number.

Import large files (eg. ZIP)

# import ALL zip files in folder /tmp/DiversityGazetteer_010013/
cd /var/www/v-species/o # the wiki path (here OpenMedia http://species-id.net/openmedia/)
# HELP
# php ./maintenance/importImages.php --help --conf
# https://www.mediawiki.org/wiki/Manual:ImportImages.php

# --comment="..."         becomes the wiki text on the page
# --user="..."            a valid user name see page Special:UserList of a wiki
# --extensions=           comma-separated list of allowable extensions, defaults defined in the Wiki's settings to $wgFileExtensions
sudo -u www-data php ./maintenance/importImages.php --conf ./LocalSettings.php --comment="
{{Metadata 
 | Type  = Dataset 
 | Title         = Dateset DiversityGazetteer
 | Description   = DiversityGazetteer is a tool to visualize places from a DiversityGazetteer database within a geographical environment.
 | Locality      =  
 | Identified By = 
 | Subject Sex   =  
 | Scientific Names =  
 | Common Names     =  
 | Language         = zxx 
 | Creators         =
 | Subject Category =
 | General Keywords = 
}}" --user="A Correct User Name" --extensions=zip /tmp/DiversityGazetteer_010013/

Problem removing files from temporary import folder

A common problem after importing is that Linux is unable to delete the files in the temporary folder used for import. A normal "rm *" or "rm * -r" will terminate with the message Argument list too long. Typically, the length of all filenames combined may not exceed 128kB (this applies to all commands, not just rm). 128kB are easily exceeded by a few thousand images with filenames appropriate as wiki titles. Solution:

# remove non-interactive
cd TheFolder; sudo find . -maxdepth 1 -name '*.?*' -exec rm {} ';'
# interactively with confirming on each file “rm -i”
cd TheFolder; sudo find . -name '*' -exec rm -i {} ';'
# with a pipe “|” and xargs rm
cd TheFolder; sudo find . -maxdepth 1 -name '*.?*' | xargs rm

# check what would be deleted
find . -maxdepth 1 -name '*.?*' -exec echo {}  ';'
# {} → the found string
# ';' → final argument for -exec (BTW + works too but is somewhat fragile, see in the manual of find «man find»)

Writing file names in a folder under Windows to text file

Open command prompt (type: cmd in Windows Start button command box)
Change directory, switch from DOS/OEM codepage 850 to ANSI, write all file names to file 000.txt:

 cd current directory
 chcp  1252
 dir/w > 000.txt

edit and process using text editor

Batch renaming files in Linux

Often files should add or drop a prefix. To add a prefix "XXX_"use:

cd ../foto01; sudo for f in *.jpg; do mv -i "$f" "XXX_$f"; done

Should this have been executed twice, or should a prefix be removed, use rename (perl utilities) with a perl substitution string (s/fromold/tonew/), like

sudo rename 's/XXX_XXX_/XXX_/' *.jpg

The util-linux-ng rename has no such perl substitution mechanism, but only simple string replacements:

rename oldstring newstring whichFiles
rename .htm .html *.htm
rename image image0 image?? # → image001, image002, ...

But the same substitution can be done with a small bash script running in a for loop and using a pipe (“|”) to sed:

################################
# replace all white space characters ' ' with underscore characters '_'
# list all *.jpg but replace all ' ' to '|' → save in $i
# (the '|' in sed is a place holder for later replacement)
for i in `ls *.jpg | sed 's/ /|/g'` ; do 
  # save in $old the old file name
  old=`echo $i   | sed 's/|/ /g'`
  # save in $old the new file name with '_'
  new=`echo "$i" | sed 's/|/_/g'`
  mv --force "$old" "$new"
done
# The original file name should not contain a “|” character, otherwise it is replaced too.

Change ownershop from root to www-data

find /var/www/v-species/*/media/thumb/     -maxdepth 3 -user root -name '*' -exec sudo chown -R  www-data:www-data '{}' ';'
find /var/www/v-species/*/*/media/thumb/   -maxdepth 3 -user root -name '*' -exec sudo chown -R  www-data:www-data '{}' ';'
find /var/www/v-species/*/*/*/media/thumb/ -maxdepth 3 -user root -name '*' -exec sudo chown -R  www-data:www-data '{}' ';'

Preparing MetaData for image files from text files

#!/bin/bash
# concatenate several files to a prepared MediaWiki-Import
# This has to be manually corrected and assumes iso-8859-1 source files
# assume files (here e.g. file.meta) with metadat; defined in filterExtension 

filterExtension="*.meta"
sourceEncoding="iso-8859-1"
targetEncoding="utf-8"

userName="WikiSysop"
comment="Bot generated metadata update"

xmlWriteToFile="allmetadata_utf8.xml" # number of all files

# some info
echo "Conactenate ${nFiles} files as MediaWiki Import into ${xmlWriteToFile}…"
# write the header to ${xmlWriteToFile}
cat > ${xmlWriteToFile} <<HEADER
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">
  <siteinfo>
    <sitename>OpenMedia</sitename>
    <base>http://species-id.net/openmedia/Main_Page</base>
    <generator>MediaWiki 1.18.0</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">OpenMedia</namespace>
      <namespace key="5" case="first-letter">OpenMedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="102" case="first-letter">Property</namespace>
      <namespace key="103" case="first-letter">Property talk</namespace>
      <namespace key="106" case="first-letter">Form</namespace>
      <namespace key="107" case="first-letter">Form talk</namespace>
      <namespace key="108" case="first-letter">Concept</namespace>
      <namespace key="109" case="first-letter">Concept talk</namespace>
      <namespace key="170" case="first-letter">Filter</namespace>
      <namespace key="171" case="first-letter">Filter talk</namespace>
      <namespace key="198" case="first-letter">Internal</namespace>
      <namespace key="199" case="first-letter">Internal talk</namespace>
      <namespace key="200" case="first-letter">Portal</namespace>
      <namespace key="201" case="first-letter">Portal talk</namespace>
      <namespace key="202" case="first-letter">Bibliography</namespace>
      <namespace key="203" case="first-letter">Bibliography talk</namespace>
      <namespace key="204" case="first-letter">Draft</namespace>
      <namespace key="205" case="first-letter">Draft talk</namespace>
      <namespace key="206" case="first-letter">Submission</namespace>
      <namespace key="207" case="first-letter">Submission talk</namespace>
      <namespace key="208" case="first-letter">Reviewed</namespace>
      <namespace key="209" case="first-letter">Reviewed talk</namespace>
      <namespace key="274" case="first-letter">Widget</namespace>
      <namespace key="275" case="first-letter">Widget talk</namespace>
    </namespaces>
  </siteinfo>
HEADER

progressInfo="."
nFile=0

for metafile in *.meta; do
  echo "<page><title>File:${metafile}</title>" >> ${xmlWriteToFile}
  date=`date +'%Y-%m-%dT%H:%M:%SZ'` # 2011-12-19T10:43:11Z
  echo "<revision><timestamp>${date}</timestamp>" >> ${xmlWriteToFile}
  echo   "<contributor><username>${userName}</username></contributor><comment>${comment}</comment>" >> ${xmlWriteToFile}
  text=`iconv -f ${sourceEncoding} -t ${targetEncoding} "${metafile}"`
  #text=`cat "${metafile}.utf8"`
  echo   "<text xml:space='preserve'>${text}</text>"  >> ${xmlWriteToFile}
  echo "</revision>" >> ${xmlWriteToFile}
 echo  "</page>" >> ${xmlWriteToFile}

  nFile=$(( $nFile + 1 ))

  # progress info 100 dots then line break with modulo
  if [ $(( $nFile % 100 )) == 0 ]; then
    echo "$progressInfo"
  else
    echo -n "$progressInfo"
  fi
done
echo "</mediawiki>"  >> ${xmlWriteToFile}
# some info
echo -e "\n … (done)"

@@ Line 4: / Line 4: @@
 {{Ombox
-|text=It is important to set the media files owner to www-data for all import media files (sudo chown -R  www-data:www-data ./my-import-media-folder). If the file's owner is root, no image scaling can be achieved.
+|text=It is important to set the media files owner to www-data for all import media files (<code>sudo chown -R  www-data:www-data ./my-import-media-folder</code>). If the file's owner is root, no image scaling can be achieved.
 |type=content
 }}
@@ Line 40: / Line 40: @@
   | Type  = StillImage
   | Title         =
-  | Description   = {{de|1= }}{{en|1=}}
+  | Description   = {{Metadata Description de|1= }}{{en|1=}}
   | Locality      =
   | Identified By =
@@ Line 63: / Line 63: @@
 If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".
-The comment can contain REAL line breaks, since the text is inside double quotes.
+The comment can contain REAL line breaks, since the text is inside double quotes but the comment ''is inserted on the page '''only the first time''''' at the import. You ''cannot override'' page content via <code>importImages.php --overwrite</code>, in that case you must [[Mediawiki XML page importing|create an XML import file]].
 To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.

Difference between revisions of "Batch importing media files into MediaWiki"

Latest revision as of 15:51, 2 June 2015

Contents

Importing binary files manually (maintenance importImages.php)

Import large files (eg. ZIP)

Problem removing files from temporary import folder

Writing file names in a folder under Windows to text file

Batch renaming files in Linux

Change ownershop from root to www-data

Preparing MetaData for image files from text files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools