Difference between revisions of "Batch importing media files into MediaWiki"

From Biowikifarm Metawiki
Jump to: navigation, search
(Importing the binary data: +bash highligting; hope it works too with '\'; +bash snippets)
Line 3: Line 3:
 
==Importing the binary data==
 
==Importing the binary data==
  
  <nowiki>cd /var/www/v-species/o; sudo -u www-data php ./maintenance/importImages.php  --conf ./LocalSettings.php --comment="{{Provider XX ZZZNAME}} {{Collection XX ZZZNAME}} {{Metadata | Type  = StillImage | Title        =  | Description  = {{de|1= }}{{en|1=}} | Locality      =  | Identified By = | Subject Sex  = female | Scientific Names =  | Common Names    =  | Language        = zxx | Creators        = XX ZZZNAME | Subject Category = Amphibia | General Keywords = }}" --user="XX ZZZNAME" /var/www/v-species/o/atmp_OR_adump > /var/www/v-species/o/aaaimport.log</nowiki>
+
<blockquote>
  cd /var/www/v-species/o; php ./maintenance/runJobs.php    --conf ./LocalSettings.php
+
  <source lang="bash">
 +
#!bin/bash
 +
# USAGE: php importImages.php [options] <dir>
 +
# options:
 +
# --comment=<text>      Set upload summary comment, default 'Importing image file'
 +
# --comment-file=<file> Set upload summary comment the the content of <file>
 +
# --dry                Dry run, don't import anything
 +
# --overwrite          Overwrite existing images with the same name (default is to skip them)
 +
# --user=<username>    Set username of uploader, default 'Maintenance script'
 +
# ... some more options
 +
#######################################
 +
# run php ./maintenance/importImages.php as user www-data
 +
#  with images from /var/www/v-species/o/atmp_OR_adump and log it to
 +
#  /var/www/v-species/o/aaaimport.log
 +
#######################################
 +
cd /var/www/v-species/o
 +
# continue the command with \
 +
sudo -u www-data php ./maintenance/importImages.php \
 +
  --conf ./LocalSettings.php \
 +
--comment="{{Provider XX ZZZNAME}}  
 +
{{Collection XX ZZZNAME}}  
 +
{{Metadata  
 +
| Type  = StillImage  
 +
| Title        =
 +
  | Description  = {{de|1= }}{{en|1=}}  
 +
| Locality      =
 +
  | Identified By =  
 +
| Subject Sex  = female  
 +
| Scientific Names =
 +
  | Common Names    =
 +
  | Language        = zxx  
 +
| Creators        = XX ZZZNAME  
 +
| Subject Category = Amphibia  
 +
| General Keywords =  
 +
}}" \
 +
--user="XX ZZZNAME" /var/www/v-species/o/atmp_OR_adump > /var/www/v-species/o/aaaimport.log
 +
# ...
 +
  cd /var/www/v-species/o
 +
# ...
 +
php ./maintenance/runJobs.php    --conf ./LocalSettings.php
 +
</source>
 +
</blockquote>
  
 
If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".
 
If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".
Line 12: Line 53:
 
To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.
 
To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.
  
It is advisable to redirect output to a file (here  > aaa.txt), to be able to carefully check import success. To check script progress one can use the size of the output file (second terminal or WinSCP) to see if import has stopped.
+
It is advisable to redirect output to a file (here  > aaaimport.log), to be able to carefully check import success. To check script progress one can use the size of the output file (second terminal or WinSCP) to see if import has stopped.
  
 
For user names the name, not the ID counts. If the name does not exist yet, it will be considered an IP-Number...
 
For user names the name, not the ID counts. If the name does not exist yet, it will be considered an IP-Number...
  
 
By default, the XML importing version of the web interface limits file sizes to around 1.4 MB. This can be changed by the server admin (or you in php.ini in maxuploadsize=); for larger imports and to prevent timeouts, use the command line interface described above.
 
By default, the XML importing version of the web interface limits file sizes to around 1.4 MB. This can be changed by the server admin (or you in php.ini in maxuploadsize=); for larger imports and to prevent timeouts, use the command line interface described above.
 +
 +
Depending on the file name an alternative approach might be to write a <code>for</code>-loop and extract parts of the file, writing it to bash variables and get more flexible imports. But it sounds maybe sophisticated to write such a stuff:
 +
 +
<blockquote>
 +
<source lang="bash">
 +
#!bin/bash
 +
#######################################
 +
# general for-loop with simple substitution ${...} see below
 +
for myfile in *.JPG; do
 +
  # save a variable
 +
  fileName=${myfile%.*}
 +
  fileExt =${myfile#*.}
 +
  echo $fileName; # print it to the terminal
 +
  echo $fileExt;
 +
  #do somthing
 +
done
 +
# for-loop: words
 +
for file in a b c; do
 +
  echo "$file copied";
 +
  # a copied
 +
  # b copied
 +
  # c copied
 +
done
 +
# for-loop: a sequence 001 002 003 etc.
 +
for i in $(seq --format=%003.f 1 150); do
 +
  echo $i
 +
done
 +
 +
#######################################
 +
# some bash replaces/substitutions in general
 +
# ${parameter/pattern search/string replaced}
 +
# example: |-> remove
 +
  # removes on the left side
 +
  longPath='./hi/structure/file.ext'
 +
  echo ${longPath#*/*} # extracts hi/structure/file.ext
 +
  #  ## → instead takes the longest match (away)
 +
  echo ${longPath##*/} # extracts 'file.ext'
 +
# example:  remove <-|
 +
  # removes on the right side
 +
  echo ${longPath%*/*} # extracts ./hi/structure
 +
  # %% → instead takes the longest match (away)
 +
  echo ${longPath%%/*} # extracts .
 +
 +
#######################################
 +
# snippet for counting JPG-files in the current directory with output
 +
nFiles=`ls *.JPG | wc -l`  # number of lines
 +
nCharMax=`ls *.JPG | wc -L` # longest line
 +
nCharsNumber=`echo $nFiles | wc -m` # number of characters
 +
i=0; # start with zero
 +
for file in *.JPG; do
 +
  i=$(expr $i + 1) # add 1
 +
  fileName=${file%.*}
 +
  fileExt=${file#*.}
 +
  printf "%"$nCharMax"s %2."$nCharsNumber"d of %1d\\n" $file $i $nFiles
 +
done
 +
# might give:
 +
#    DSC03678.JPG 0001 of 362
 +
#    DSC03679.JPG 0002 of 362
 +
#    ...
 +
</source>
 +
</blockquote>
  
 
== Problem removing files from temporary import folder ==
 
== Problem removing files from temporary import folder ==

Revision as of 23:32, 23 June 2010

Note: This is about importing binary files like images. See also: Mediawiki XML page importing (i.e. import text for pages)

Importing the binary data

#!bin/bash
# USAGE: php importImages.php [options] <dir>
# options:
# --comment=<text>      Set upload summary comment, default 'Importing image file'
# --comment-file=<file> Set upload summary comment the the content of <file>
# --dry                 Dry run, don't import anything
# --overwrite           Overwrite existing images with the same name (default is to skip them)
# --user=<username>     Set username of uploader, default 'Maintenance script'
# ... some more options
#######################################
# run php ./maintenance/importImages.php as user www-data 
#   with images from /var/www/v-species/o/atmp_OR_adump and log it to
#   /var/www/v-species/o/aaaimport.log
#######################################
cd /var/www/v-species/o
# continue the command with \
sudo -u www-data php ./maintenance/importImages.php \
 --conf ./LocalSettings.php \
 --comment="{{Provider XX ZZZNAME}} 
{{Collection XX ZZZNAME}} 
{{Metadata 
 | Type  = StillImage 
 | Title         =  
 | Description   = {{de|1= }}{{en|1=}} 
 | Locality      =  
 | Identified By = 
 | Subject Sex   = female 
 | Scientific Names =  
 | Common Names     =  
 | Language         = zxx 
 | Creators         = XX ZZZNAME 
 | Subject Category = Amphibia 
 | General Keywords = 
}}" \
 --user="XX ZZZNAME" /var/www/v-species/o/atmp_OR_adump > /var/www/v-species/o/aaaimport.log
 # ...
 cd /var/www/v-species/o
 # ...
 php ./maintenance/runJobs.php    --conf ./LocalSettings.php

If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".

The comment can contain REAL line breaks, since the text is inside double quotes. Using \n may or may not work, real line breaks are preferred.

To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.

It is advisable to redirect output to a file (here > aaaimport.log), to be able to carefully check import success. To check script progress one can use the size of the output file (second terminal or WinSCP) to see if import has stopped.

For user names the name, not the ID counts. If the name does not exist yet, it will be considered an IP-Number...

By default, the XML importing version of the web interface limits file sizes to around 1.4 MB. This can be changed by the server admin (or you in php.ini in maxuploadsize=); for larger imports and to prevent timeouts, use the command line interface described above.

Depending on the file name an alternative approach might be to write a for-loop and extract parts of the file, writing it to bash variables and get more flexible imports. But it sounds maybe sophisticated to write such a stuff:

#!bin/bash
#######################################
# general for-loop with simple substitution ${...} see below
for myfile in *.JPG; do
  # save a variable
  fileName=${myfile%.*}
  fileExt =${myfile#*.}
  echo $fileName; # print it to the terminal
  echo $fileExt;
  #do somthing
done
# for-loop: words
for file in a b c; do
  echo "$file copied";
  # a copied
  # b copied
  # c copied
done
# for-loop: a sequence 001 002 003 etc.
for i in $(seq --format=%003.f 1 150); do
  echo $i
done

#######################################
# some bash replaces/substitutions in general
# ${parameter/pattern search/string replaced}
# example: |-> remove
  # removes on the left side
  longPath='./hi/structure/file.ext'
  echo ${longPath#*/*} # extracts hi/structure/file.ext
  #  ## → instead takes the longest match (away)
  echo ${longPath##*/} # extracts 'file.ext'
# example:  remove <-| 
  # removes on the right side
  echo ${longPath%*/*} # extracts ./hi/structure
  # %% → instead takes the longest match (away)
  echo ${longPath%%/*} # extracts .

#######################################
# snippet for counting JPG-files in the current directory with output
nFiles=`ls *.JPG | wc -l`   # number of lines
nCharMax=`ls *.JPG | wc -L` # longest line
nCharsNumber=`echo $nFiles | wc -m` # number of characters
i=0; # start with zero
for file in *.JPG; do
  i=$(expr $i + 1) # add 1
  fileName=${file%.*}
  fileExt=${file#*.}
  printf "%"$nCharMax"s %2."$nCharsNumber"d of %1d\\n" $file $i $nFiles
done
# might give:
#    DSC03678.JPG 0001 of 362
#    DSC03679.JPG 0002 of 362
#    ...

Problem removing files from temporary import folder

A common problem after importing is that Linux is unable to delete the files in the temporary folder used for import. A normal "rm *" or "rm * -r" will terminate with the message Argument list too long. Typically, the length of all filenames combined may not exceed 128kB (this applies to all commands, not just rm). 128kB are easily exceeded by a few thousand images with filenames appropriate as wiki titles. Solution:

cd TheFolder; sudo find . -name '*' | xargs rm

will do the job.

Writing file names in a folder under Windows to text file

  • Open command prompt (type: cmd in Windows Start button command box)
  • type: chcp 1252 to switch from DOS/OEM codepage 850 to ANSI
  • type: dir/w > 000.txt to write all file names to file 000.txt
  • edit and process using text editor

Batch renaming files in Linux

Often files should add or drop a prefix. To add a prefix "XXX_"use:

cd ../foto01; sudo for i in *.jpg; do mv -i "$i" "XXX_$i"; done

Should this have been executed twice, or should a prefix be removed, use rename with a perl substitution string (s/fromold/tonew/), like

sudo rename 's/XXX_XXX_/XXX_/' *.jpg