Difference between revisions of "Batch importing media files into MediaWiki"

Revision as of 23:32, 23 June 2010

Note: This is about importing binary files like images. See also: Mediawiki XML page importing (i.e. import text for pages)

Importing the binary data

#!bin/bash
# USAGE: php importImages.php [options] <dir>
# options:
# --comment=<text>      Set upload summary comment, default 'Importing image file'
# --comment-file=<file> Set upload summary comment the the content of <file>
# --dry                 Dry run, don't import anything
# --overwrite           Overwrite existing images with the same name (default is to skip them)
# --user=<username>     Set username of uploader, default 'Maintenance script'
# ... some more options
#######################################
# run php ./maintenance/importImages.php as user www-data 
#   with images from /var/www/v-species/o/atmp_OR_adump and log it to
#   /var/www/v-species/o/aaaimport.log
#######################################
cd /var/www/v-species/o
# continue the command with \
sudo -u www-data php ./maintenance/importImages.php \
 --conf ./LocalSettings.php \
 --comment="{{Provider XX ZZZNAME}} 
{{Collection XX ZZZNAME}} 
{{Metadata 
 | Type  = StillImage 
 | Title         =  
 | Description   = {{de|1= }}{{en|1=}} 
 | Locality      =  
 | Identified By = 
 | Subject Sex   = female 
 | Scientific Names =  
 | Common Names     =  
 | Language         = zxx 
 | Creators         = XX ZZZNAME 
 | Subject Category = Amphibia 
 | General Keywords = 
}}" \
 --user="XX ZZZNAME" /var/www/v-species/o/atmp_OR_adump > /var/www/v-species/o/aaaimport.log
 # ...
 cd /var/www/v-species/o
 # ...
 php ./maintenance/runJobs.php    --conf ./LocalSettings.php

If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".

The comment can contain REAL line breaks, since the text is inside double quotes. Using \n may or may not work, real line breaks are preferred.

To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.

It is advisable to redirect output to a file (here > aaaimport.log), to be able to carefully check import success. To check script progress one can use the size of the output file (second terminal or WinSCP) to see if import has stopped.

For user names the name, not the ID counts. If the name does not exist yet, it will be considered an IP-Number...

By default, the XML importing version of the web interface limits file sizes to around 1.4 MB. This can be changed by the server admin (or you in php.ini in maxuploadsize=); for larger imports and to prevent timeouts, use the command line interface described above.

Depending on the file name an alternative approach might be to write a for-loop and extract parts of the file, writing it to bash variables and get more flexible imports. But it sounds maybe sophisticated to write such a stuff:

#!bin/bash
#######################################
# general for-loop with simple substitution ${...} see below
for myfile in *.JPG; do
  # save a variable
  fileName=${myfile%.*}
  fileExt =${myfile#*.}
  echo $fileName; # print it to the terminal
  echo $fileExt;
  #do somthing
done
# for-loop: words
for file in a b c; do
  echo "$file copied";
  # a copied
  # b copied
  # c copied
done
# for-loop: a sequence 001 002 003 etc.
for i in $(seq --format=%003.f 1 150); do
  echo $i
done

#######################################
# some bash replaces/substitutions in general
# ${parameter/pattern search/string replaced}
# example: |-> remove
  # removes on the left side
  longPath='./hi/structure/file.ext'
  echo ${longPath#*/*} # extracts hi/structure/file.ext
  #  ## → instead takes the longest match (away)
  echo ${longPath##*/} # extracts 'file.ext'
# example:  remove <-| 
  # removes on the right side
  echo ${longPath%*/*} # extracts ./hi/structure
  # %% → instead takes the longest match (away)
  echo ${longPath%%/*} # extracts .

#######################################
# snippet for counting JPG-files in the current directory with output
nFiles=`ls *.JPG | wc -l`   # number of lines
nCharMax=`ls *.JPG | wc -L` # longest line
nCharsNumber=`echo $nFiles | wc -m` # number of characters
i=0; # start with zero
for file in *.JPG; do
  i=$(expr $i + 1) # add 1
  fileName=${file%.*}
  fileExt=${file#*.}
  printf "%"$nCharMax"s %2."$nCharsNumber"d of %1d\\n" $file $i $nFiles
done
# might give:
#    DSC03678.JPG 0001 of 362
#    DSC03679.JPG 0002 of 362
#    ...

Problem removing files from temporary import folder

A common problem after importing is that Linux is unable to delete the files in the temporary folder used for import. A normal "rm *" or "rm * -r" will terminate with the message Argument list too long. Typically, the length of all filenames combined may not exceed 128kB (this applies to all commands, not just rm). 128kB are easily exceeded by a few thousand images with filenames appropriate as wiki titles. Solution:

cd TheFolder; sudo find . -name '*' | xargs rm

will do the job.

Writing file names in a folder under Windows to text file

Open command prompt (type: cmd in Windows Start button command box)
type: chcp 1252 to switch from DOS/OEM codepage 850 to ANSI
type: dir/w > 000.txt to write all file names to file 000.txt
edit and process using text editor

Batch renaming files in Linux

Often files should add or drop a prefix. To add a prefix "XXX_"use:

cd ../foto01; sudo for i in *.jpg; do mv -i "$i" "XXX_$i"; done

Should this have been executed twice, or should a prefix be removed, use rename with a perl substitution string (s/fromold/tonew/), like

sudo rename 's/XXX_XXX_/XXX_/' *.jpg

Difference between revisions of "Batch importing media files into MediaWiki"

Revision as of 23:32, 23 June 2010

Contents

Importing the binary data

Problem removing files from temporary import folder

Writing file names in a folder under Windows to text file

Batch renaming files in Linux

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools