Caching Wikimedia Commons file to local repository

From Biowikifarm Metawiki
Revision as of 18:07, 28 January 2010 by Gregor Hagedorn (Talk | contribs) (Created page with '==Problem and Motivation== A Wiki can use images from its own media repository or from other repositories shared by multiple consumers. These shared repositories may either prov...')

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Problem and Motivation

A Wiki can use images from its own media repository or from other repositories shared by multiple consumers. These shared repositories may either provide database access (typical for repositories on the same wiki-farm; the local example is http://www.species-id.net/openmedia) or through an xml-based webservice API. A very large shared media repository accessible through API is Wikimedia Commons.

In practice it was found, however, that local pages with many images from Wikimedia Commons perform poorly for logged in users (users not logged in receive a cached copy of the page, so they are not affected). When a page is displayed to a logged-in user, an API call is made for each embedded image. When the shared repository has high load, some of these calls may be answered later or not at all.

Mediawiki provides a built-in caching mechanism. However, this mechanism does not allow

Solution

For each image that is called through API-Calls, with some delay, a local copy of the original, non-scaled image is copied to the local repository, together with a copy of the metadata page, to which a note on the copy and links back to the original are added. The name of the image from Commons and on OpenMedia will be identical. Thus, the next time the page is rendered, the database-shared-repository is accessed first. If this reports the presence of an item, the webservice-based Commons repository will no longer be accessed and no delays occur.

Questions

How do we learn about images used in local web pages?

Are images copied immediately, or with delay in a batch operation?

Is the copying service manual, cron job, or embedded in the wiki operation?

How are things (python, mediawiki) glued together? Does it just use the API or also database access?

Manol notes: For the scripts standard tools (from pywiki lib) don't do EXACTLY what we want. I will use combination of that pywiki scripts. I already can write completely new script but this will be temporary solution (because of frequently changes in mediawiki api and db).