Caching Wikimedia Commons file to local repository
Contents
Problem and Motivation
A Wiki can use images from its own media repository or from other repositories shared by multiple consumers. These shared repositories may either provide database access (typical for repositories on the same wiki-farm; the local example is http://www.species-id.net/openmedia) or through an xml-based webservice API. A very large shared media repository accessible through API is Wikimedia Commons.
In practice it was found, however, that local pages with many images from Wikimedia Commons perform poorly for logged in users (users not logged in receive a cached copy of the page, and if the cached copy exists they are not affected). When a page is displayed to a logged-in user, an API call is made for each embedded image. When the shared repository has high load, some of these calls may be answered later or not at all.
Mediawiki provides a built-in caching mechanism. However, this mechanism does not allow
Solution
For each shared image (that is called through remote API-Calls, with some delay), a local copy of the original, non-scaled image is copied to the local repository, together with a copy of the metadata page, to which a note on the copy and links back to the original are added. The name of the image from Commons and on OpenMedia will be identical. Thus, the next time the page is rendered, the database-shared-repository is accessed first. If this reports the presence of an item, the webservice-based Commons repository will no longer be accessed and no delays occur.
Questions
How do we learn about images used in local web pages?
Are images copied immediately, or with delay in a batch operation?
Is the copying service manual, cron job, or embedded in the wiki operation?
How are things (python, mediawiki) glued together? Does it just use the API or also database access?
Proposed solution
About caching of shared images from commons there is 2 problems:
- detect which image is from commons
- cache the image (make it local)
Detecting images
There is 2 ways for that:
custom or extended ForeignAPIRepo handler which can record image names
It turns out that ForeignAPIRepo.php is the only place in mediawiki execution where the shared file names are known (there is no records in DB, there is no suitable hooks). So we must use it to find shared images.
The best way to extend it is only to FAST and properly record the shared images names (in filesystem and/or database).
Also it is bad idea to perform the real caching/mirroring operation in ForeignAPIRepo handler. This is because repo module level is very low (many high levels are using it, for one high level access to image repo api is called many times) and if we perform slow operation (like remote fetching of image) there we can break the performance of whole wiki.
So downloading of full sized version and local importing can be done [asynchronous] and out of the ForeignAPIRepo.
current ForeignAPIRepo.php with cache
At this moment ForeignAPIRepo.php perform some kind of "recording": it saves image thumbs in
$wgUploadDirectory/thumb/image_name/...
(path 'thumb' is fixed in source of ForeignAPIRepo.php).
It ONLY works if cache is ON:
apiThumbCacheExpiry > 0
So if we have cache we can use this [side effect] and can use the dir names (*.jpg|png|svg...) created in media/thumb/
Cache the images
When we have the name of the image we have to download the original (full sized) from remote repo. Second we have to import image into local wiki (and the image will become local).
Image download
- query remote repo through mediawiki api and get the URL of the original
- download file from URL
Import image
In the maintenance script maintenance/importImages.php (revision 62087, Sun Feb 7 16:10:14 2010 ) there is new feature smart-import:
- there is new parameter --source-wiki-url that points to a wiki containing metadata (original uploader, comment from, etc.) and the script fetches the metadata from the remote wiki
Implementation
All needed files are in /usr/share/mediawiki/ext-LOCAL-svn/CommonsMediaCaching
To detect remote images that shall be copied to the local shared repository, we currently are using the file names that the ForeignAPIRepo cache (built into mediawiki) creates in the file system . These image names are in: WIKIDIR/media/thumb/??*.*
cacheCommons.sh
This is a Bash shell script that runs all actions, it is intended to be used as a cronjob.
A cronjob file (you have to copy it to /etc/cron.d/):
wiki-cache-commons.cron
This script first downloads all remote images for the wikis specified in the WIKILIST parameter (see below), then imports these images into the local shared media repository. This script uses all the following implemented mediawiki scripts.
In this script you must set these parameters (as variables in the script):
- WIKILIST="/var/www/v-k2n/w/ /var/www/v-k2n/h/" (etc.)
- This is the list of base paths of the wikis that use a remote image repository and which should be handled by the caching mechanism
- DST_WIKI="/var/www/v-species/o/"
- This is the base path of the local shared media repository wiki. The images will be imported into that wiki.
- SHARED_DOWN_DIR="$DST_WIKI/media/commons-fullsize/"
- This directory/folder contains the original files downloaded from remote repo.
downImagesCommons.php
The script takes an image name as a parameter and downloads it from Commons to the directory specified with --todir parameter:
cd WIKIDIR php maintenance/downImagesCommons.php --conf ./LocalSettings.php --todir /var/tmp/ some-commons-image.jpg
This script must stay in maintenance/ directory because it uses the mediawiki maintenance libs.
importImagesCommons.inc
This script is a close (originally exact) copy of importImages.inc. It was modified to obtain metadata from the image metadata rather than from the image comment. See https://bugzilla.wikimedia.org/show_bug.cgi?id=30582
It is used by importImagesCommons.php
importImagesCommons.php
This is a modified version of importImages.php (http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=62087). Then the following modification was done:
- added text calling a documentation template (Cached Commons Media) to maintain the link and make it transparent for the purpose of the Creative Commons license.
You can see the patch/diff by:
diff -Naur importImages.php.62087 importImagesCommons.php
Installation
(New code version for 1.20/git mediawiki installation! NOTE: With new mediawiki versions, the importImagesCommons.* files may have to be updated, i.e. changes re-done in the new version. Compare e.g. importImagesCommons_1.20_unchanged, for comparison.inc with importImagesCommons.inc to see the actual changes made in v. 1.20.)
cd /usr/share/mediawiki20/maintenance/; ### OR: ### cd /usr/share/mediawikistaging/maintenance/; # hard link (or copy) scripts to maintenance folder of the # destination wiki (into which images will be imported into): # NOTE: -l is hardlink (-s is softlink, which will not work here!) # However, we had trouble with hardlinks, so copying instead! # Copy (remove if hardlinks already exist, then copy new): sudo rm downImagesCommons.php importImagesCommons.inc importImagesCommons.php; sudo cp /usr/share/mediawiki/ext-LOCAL-svn/CommonsMediaCaching/downImagesCommons.php .; sudo cp /usr/share/mediawiki/ext-LOCAL-svn/CommonsMediaCaching/importImagesCommons.php .; sudo cp /usr/share/mediawiki/ext-LOCAL-svn/CommonsMediaCaching/importImagesCommons.inc .;
# link the cron job file: cd /etc/cron.d/ sudo cp -s ...ext-LOCAL-svn/CommonsMediaCaching/*.cron .
# TEST outside of cronjob, use (do not run as sudo = root!) sudo su www-data; /usr/share/mediawiki/ext-LOCAL-svn/CommonsMediaCaching/cacheCommons.sh
Note: if testing reports, that some thumbs cannot be removed, these thumbs have been created as root. The fix is for most wikis to change owner and group recursively:
sudo chown www-data:www-data /var/www/*/?/media/* -R; sudo chown www-data:www-data /var/www/*/*/?/media/* -R;
(OUTDATED) Discussion
quickly upload images from Commons. This extension demonstrates how to extend the new upload system.'
Indeed this is very small extension. It just copy the image, without comments or history. I easily will add json api calls to fetch the info.
Problems:
- class UploadFromUrl don't allow to set image attributes, it wants only url :-( - new upload api
is designed especially for submit from Special:Upload page (with their new extensions), i.e. console tool can't easily use this api
I already know the calls for import into mediawiki, I will implement the new script (not plugin) without help of APIs. Problem with this way is harder support between mediawiki versions ...
I have to add calls to wikimedia api to fetch text and history for every image. I think all of the info exists in this api call (for the image: http://commons.wikimedia.org/wiki/File:Nuvola_apps_knewsticker.png): http://commons.wikimedia.org/w/api.php?titles=Image:Nuvola%20apps%20knewsticker.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime&prop=imageinfo&format=jsonfm&action=query&iilimit=10
Also this extension have graphical UI. I really hesitate do we need GUI... I prefer to do caching automatically and the best ui for this is console (if the plugin has gui I have to call it by bot, or by hand every day :) ).
I am trying to extract the important code from this plugin which makes the insert into mediawiki and use it in my console plugin. I have some troubles because the plugin really don't make insert, it only calls action from standard Special:Upload page (through new(1.16) hooks). I really don't need Special:Upload because the files are already in the server (I can get them from wikimedia with script). I have to extract important code for insert into mediawiki from Special:Upload "API" source code (http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/upload/).
The new api has UploadFromUrl class, I can use it and don't need to download image in advance. So, I found extension that demonstrates this new API(it is for commons images! and is not documented yet!!): http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadFromCommons/
I can base on this extension. It also imports images from other source into mediawiki. The import is implemented very strange, it calls Special:Upload procedure indirectly through emulated request (author wrote that this is only way) and this method is very unstable (and hard to detect errors from sub request).
I found that author had patch for 1.16, and in 1.16 Special:Upload already is as API, so I can call it directly :) We are using 1.16 and this is simplify things. I found extension that uses the new 1.16 upload api, see 2.
I liked scripts as imagecopy.py - Copies images from a wikimedia wiki to Commons, upload.py - Uploads images to a wiki, imagetransfer.py - Copies images to another wiki.
But there is few problems with these bots and scripts and I prefer instead use script/plugin with direct access (through api) to mediawiki:
- bots are perform actions very indirectly, they simulate requests, parse html from responses
and acts as normal users (user agents). This is very unstable for me and cannot stay between mediawiki versions without many undocumented changes.
- python API is unofficial and unstable. I strongly prefer official mediawiki API which is in
PHP.