Difference between revisions of "Batch importing media files into MediaWiki"
m (→Batch renaming files in Linux: i → f) |
m (added Category:MediaWiki using HotCat) |
||
Line 344: | Line 344: | ||
</source> | </source> | ||
[[Category: Import]] | [[Category: Import]] | ||
+ | [[Category:MediaWiki]] |
Revision as of 11:57, 4 February 2015
Note: This is about importing binary files like images from the command line. See also: Mediawiki XML page importing (i.e. import text for pages using an xml-format).
Contents
- 1 Importing binary files manually (maintenance importImages.php)
- 2 Problem removing files from temporary import folder
- 3 Writing file names in a folder under Windows to text file
- 4 Batch renaming files in Linux
- 5 Change ownershop from root to www-data
- 6 Preparing MetaData for image files from text files
Importing binary files manually (maintenance importImages.php)
It is important to set the media files owner to www-data for all import media files (sudo chown -R www-data:www-data ./my-import-media-folder). If the file's owner is root, no image scaling can be achieved. |
#!bin/bash # USAGE: php importImages.php [options] <dir> # options: # --comment=<text> Set upload summary comment, default 'Importing image file' # --comment-file=<file> Set upload summary comment the the content of <file> # --comment-ext=<ext> Set extension for comment file # --dry Dry run, don't import anything (*but create the page*) # --overwrite Overwrite existing images with the same name (default is to skip them) # --user=<username> Set username of uploader, default 'Maintenance script' # ... some more options ####################################### # run php ./maintenance/importImages.php as user www-data # with images from /var/www/v-species/o/my-import-media-folder and log it to # /var/www/v-species/o/my-import-media.log ####################################### # step 0: prepare your media and store them into a temporary folder # in wiki openmedia (/var/www/v-species/o/my-import-media-folder) # make sure to have informative file names, e.g. “what” and “from whom” # “Zygiobia carpini Loew, 1874 on Carpinus betulus (Michal Maňas, 2013).jpg” # set owner to all import media files to www-data # cd /var/www/v-species/o/my-import-media-folder && sudo chown -R www-data:www-data ./ ####################################### # step 1: go to wiki openmedia (root) # hint in bash: \ means to continue the command line over multiple lines ####################################### cd /var/www/v-species/o sudo -u www-data php ./maintenance/importImages.php --conf ./LocalSettings.php --comment="{{Provider XX ZZZNAME}} {{Collection XX ZZZNAME}} {{Metadata | Type = StillImage | Title = | Description = {{de|1= }}{{en|1=}} | Locality = | Identified By = | Subject Sex = female | Scientific Names = | Common Names = | Language = zxx | Creators = XX ZZZNAME | Subject Category = Amphibia | General Keywords = }}" --user="XX ZZZNAME" /var/www/v-species/o/my-import-media-folder > /var/www/v-species/o/my-import-media.log # update indices / job queue cd /var/www/v-species/o # optionally rebuild the links and indices used for searching your site # sudo -u www-data php ./maintenance/rebuildall.php --dbuser wikiadmin --conf ./LocalSettings.php # manually force the job queue to run # (sudo -u www-data AND --dbuser wikiadminCORRECT??) sudo -u www-data php ./maintenance/runJobs.php --dbuser wikiadmin --conf ./LocalSettings.php --procs=3
If no comment is added a default comment "Importing image file" will be inserted by mediawiki. The default user if none is added is "Maintenance script".
The comment can contain REAL line breaks, since the text is inside double quotes.
To reduce server load, one may add the sleep option, time in seconds between files, e.g. --sleep=2.
It is advisable to redirect output to a file (here > aaaimport.log), to be able to carefully check import success. To check script progress one can use the size of the output file (open a second ssh shell) to see if import has stopped.
For user names the name, not the ID counts. If the name does not exist yet, it will be imported nevertheless and stored like an IP-Number.
Import large files (eg. ZIP)
# import ALL zip files in folder /tmp/DiversityGazetteer_010013/
cd /var/www/v-species/o # the wiki path (here OpenMedia http://species-id.net/openmedia/)
# HELP
# php ./maintenance/importImages.php --help --conf
# https://www.mediawiki.org/wiki/Manual:ImportImages.php
# --comment="..." becomes the wiki text on the page
# --user="..." a valid user name see page Special:UserList of a wiki
# --extensions= comma-separated list of allowable extensions, defaults defined in the Wiki's settings to $wgFileExtensions
sudo -u www-data php ./maintenance/importImages.php --conf ./LocalSettings.php --comment="
{{Metadata
| Type = Dataset
| Title = Dateset DiversityGazetteer
| Description = DiversityGazetteer is a tool to visualize places from a DiversityGazetteer database within a geographical environment.
| Locality =
| Identified By =
| Subject Sex =
| Scientific Names =
| Common Names =
| Language = zxx
| Creators =
| Subject Category =
| General Keywords =
}}" --user="A Correct User Name" --extensions=zip /tmp/DiversityGazetteer_010013/
Problem removing files from temporary import folder
A common problem after importing is that Linux is unable to delete the files in the temporary folder used for import. A normal "rm *" or "rm * -r" will terminate with the message Argument list too long. Typically, the length of all filenames combined may not exceed 128kB (this applies to all commands, not just rm). 128kB are easily exceeded by a few thousand images with filenames appropriate as wiki titles. Solution:
# remove non-interactive cd TheFolder; sudo find . -maxdepth 1 -name '*.?*' -exec rm {} ';' # interactively with confirming on each file “rm -i” cd TheFolder; sudo find . -name '*' -exec rm -i {} ';' # with a pipe “|” and xargs rm cd TheFolder; sudo find . -maxdepth 1 -name '*.?*' | xargs rm # check what would be deleted find . -maxdepth 1 -name '*.?*' -exec echo {} ';' # {} → the found string # ';' → final argument for -exec (BTW + works too but is somewhat fragile, see in the manual of find «man find»)
Writing file names in a folder under Windows to text file
- Open command prompt (type: cmd in Windows Start button command box)
- Change directory, switch from DOS/OEM codepage 850 to ANSI, write all file names to file 000.txt:
cd current directory
chcp 1252
dir/w > 000.txt
- edit and process using text editor
Batch renaming files in Linux
Often files should add or drop a prefix. To add a prefix "XXX_"use:
cd ../foto01; sudo for f in *.jpg; do mv -i "$f" "XXX_$f"; done
Should this have been executed twice, or should a prefix be removed, use rename (perl utilities) with a perl substitution string (s/fromold/tonew/), like
sudo rename 's/XXX_XXX_/XXX_/' *.jpg
The util-linux-ng rename has no such perl substitution mechanism, but only simple string replacements:
rename oldstring newstring whichFiles rename .htm .html *.htm rename image image0 image?? # → image001, image002, ...
But the same substitution can be done with a small bash script running in a for loop and using a pipe (“|”) to sed:
################################ # replace all white space characters ' ' with underscore characters '_' # list all *.jpg but replace all ' ' to '|' → save in $i # (the '|' in sed is a place holder for later replacement) for i in `ls *.jpg | sed 's/ /|/g'` ; do # save in $old the old file name old=`echo $i | sed 's/|/ /g'` # save in $old the new file name with '_' new=`echo "$i" | sed 's/|/_/g'` mv --force "$old" "$new" done # The original file name should not contain a “|” character, otherwise it is replaced too.
Change ownershop from root to www-data
find /var/www/v-species/*/media/thumb/ -maxdepth 3 -user root -name '*' -exec sudo chown -R www-data:www-data '{}' ';' find /var/www/v-species/*/*/media/thumb/ -maxdepth 3 -user root -name '*' -exec sudo chown -R www-data:www-data '{}' ';' find /var/www/v-species/*/*/*/media/thumb/ -maxdepth 3 -user root -name '*' -exec sudo chown -R www-data:www-data '{}' ';'
Preparing MetaData for image files from text files
#!/bin/bash
# concatenate several files to a prepared MediaWiki-Import
# This has to be manually corrected and assumes iso-8859-1 source files
# assume files (here e.g. file.meta) with metadat; defined in filterExtension
filterExtension="*.meta"
sourceEncoding="iso-8859-1"
targetEncoding="utf-8"
userName="WikiSysop"
comment="Bot generated metadata update"
xmlWriteToFile="allmetadata_utf8.xml" # number of all files
# some info
echo "Conactenate ${nFiles} files as MediaWiki Import into ${xmlWriteToFile}…"
# write the header to ${xmlWriteToFile}
cat > ${xmlWriteToFile} <<HEADER
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">
<siteinfo>
<sitename>OpenMedia</sitename>
<base>http://species-id.net/openmedia/Main_Page</base>
<generator>MediaWiki 1.18.0</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">OpenMedia</namespace>
<namespace key="5" case="first-letter">OpenMedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="102" case="first-letter">Property</namespace>
<namespace key="103" case="first-letter">Property talk</namespace>
<namespace key="106" case="first-letter">Form</namespace>
<namespace key="107" case="first-letter">Form talk</namespace>
<namespace key="108" case="first-letter">Concept</namespace>
<namespace key="109" case="first-letter">Concept talk</namespace>
<namespace key="170" case="first-letter">Filter</namespace>
<namespace key="171" case="first-letter">Filter talk</namespace>
<namespace key="198" case="first-letter">Internal</namespace>
<namespace key="199" case="first-letter">Internal talk</namespace>
<namespace key="200" case="first-letter">Portal</namespace>
<namespace key="201" case="first-letter">Portal talk</namespace>
<namespace key="202" case="first-letter">Bibliography</namespace>
<namespace key="203" case="first-letter">Bibliography talk</namespace>
<namespace key="204" case="first-letter">Draft</namespace>
<namespace key="205" case="first-letter">Draft talk</namespace>
<namespace key="206" case="first-letter">Submission</namespace>
<namespace key="207" case="first-letter">Submission talk</namespace>
<namespace key="208" case="first-letter">Reviewed</namespace>
<namespace key="209" case="first-letter">Reviewed talk</namespace>
<namespace key="274" case="first-letter">Widget</namespace>
<namespace key="275" case="first-letter">Widget talk</namespace>
</namespaces>
</siteinfo>
HEADER
progressInfo="."
nFile=0
for metafile in *.meta; do
echo "<page><title>File:${metafile}</title>" >> ${xmlWriteToFile}
date=`date +'%Y-%m-%dT%H:%M:%SZ'` # 2011-12-19T10:43:11Z
echo "<revision><timestamp>${date}</timestamp>" >> ${xmlWriteToFile}
echo "<contributor><username>${userName}</username></contributor><comment>${comment}</comment>" >> ${xmlWriteToFile}
text=`iconv -f ${sourceEncoding} -t ${targetEncoding} "${metafile}"`
#text=`cat "${metafile}.utf8"`
echo "<text xml:space='preserve'>${text}</text>" >> ${xmlWriteToFile}
echo "</revision>" >> ${xmlWriteToFile}
echo "</page>" >> ${xmlWriteToFile}
nFile=$(( $nFile + 1 ))
# progress info 100 dots then line break with modulo
if [ $(( $nFile % 100 )) == 0 ]; then
echo "$progressInfo"
else
echo -n "$progressInfo"
fi
done
echo "</mediawiki>" >> ${xmlWriteToFile}
# some info
echo -e "\n … (done)"