Difference between revisions of "Mediawiki XML page importing"

From Biowikifarm Metawiki
Jump to: navigation, search
(Importing Data from Command Line Interface)
m (+source +notes on RunJobs)
Line 21: Line 21:
  
 
Use [[Special:Export]] or:
 
Use [[Special:Export]] or:
 +
<source lang="bash">
 
  php dumpBackup.php --full >d:\backup\dump.xml
 
  php dumpBackup.php --full >d:\backup\dump.xml
 
  cd /var/www/testwiki; php ./maintenance/dumpBackup.php --full --conf ./LocalSettings.php > ./EXPORT.xml
 
  cd /var/www/testwiki; php ./maintenance/dumpBackup.php --full --conf ./LocalSettings.php > ./EXPORT.xml
 
+
</source>
 
(See also [http://www.mediawiki.org/wiki/Manual:DumpBackup.php DumpBackup.php])
 
(See also [http://www.mediawiki.org/wiki/Manual:DumpBackup.php DumpBackup.php])
  
Line 31: Line 32:
  
 
Import works directly with 7z (or zip, bzip) compressed xml files! HOWEVER, PRESENTLY .7z DOES NOT WORKTransfer the xml file to the server, and execute (example):
 
Import works directly with 7z (or zip, bzip) compressed xml files! HOWEVER, PRESENTLY .7z DOES NOT WORKTransfer the xml file to the server, and execute (example):
 +
<source lang="bash">
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/importDump.php /var/www/v-xxx/w/import.xml --conf ./LocalSettings.php
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/importDump.php /var/www/v-xxx/w/import.xml --conf ./LocalSettings.php
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 
  cd /var/www/v-xxx/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 
 
  # e.g. FOR OPENMEDIA:
 
  # e.g. FOR OPENMEDIA:
 
  cd /var/www/v-species/o; sudo php ./maintenance/importDump.php ./atmp/import.xml --conf ./LocalSettings.php
 
  cd /var/www/v-species/o; sudo php ./maintenance/importDump.php ./atmp/import.xml --conf ./LocalSettings.php
Line 43: Line 44:
 
  cd /var/www/v-on/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 
  cd /var/www/v-on/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 
  cd /var/www/v-on/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 
  cd /var/www/v-on/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 +
</source>
  
 
(Rebuilding internal indices is necessary after import; rebuildall may be slow and can be replaced with  
 
(Rebuilding internal indices is necessary after import; rebuildall may be slow and can be replaced with  
 +
<source lang="bash">
 
  cd /var/www/v-on/w; sudo php ./maintenance/rebuildrecentchanges.php    --conf ./LocalSettings.php
 
  cd /var/www/v-on/w; sudo php ./maintenance/rebuildrecentchanges.php    --conf ./LocalSettings.php
if necessary. RunJobs: if import contains complex template relations, manually emptying the job queue may be necessary, check Special:Statistics in the wiki. Note: "--procs=3" will run three jobs in parallel, if the server has the necessary number of processor cores.)
+
</source>
 +
if necessary. RunJobs: if import contains complex template relations or when updating template relations, data entries in templates, manually emptying the job queue may be necessary, check Special:Statistics in the wiki. Note: "--procs=3" will run three jobs in parallel, if the server has the necessary number of processor cores.)
  
 
Important: for all batchimporting, revisiondate must be set to something newer than all old revisions; else mediawiki will sort the imported revision '''behind''' existing revisions. The id in the imported xml is not necessary, however.
 
Important: for all batchimporting, revisiondate must be set to something newer than all old revisions; else mediawiki will sort the imported revision '''behind''' existing revisions. The id in the imported xml is not necessary, however.
Line 54: Line 58:
 
Example:
 
Example:
  
  sudo php ./maintenance/deleteBatch.php --conf ./LocalSettings.php -r "remove wrong resolution" ./maintenance/deleteBatch.txt
+
<source lang="bash">
 
+
  sudo php ./maintenance/deleteBatch.php \
 +
  --conf ./LocalSettings.php \
 +
  -r "remove wrong resolution" ./maintenance/deleteBatch.txt
 +
</source>
 
deleteBatch.txt contains only the filenames.  
 
deleteBatch.txt contains only the filenames.  
  
Line 64: Line 71:
 
The '''Web interface''' under Special:Import will create extra revisions (in addition to those imported) designating the importing user. If you don't want to document who did a transfer, it may therefore be desirable to use the command-line version (see below). For the web import it may be desirable to create a special "Import-User" so that the name better documents authorship than using a normal username during upload of the xml file. Important creates two revisions for each page: Revision 1 is the imported revision, Revision 2 is the revision documenting the import process. If the imported data alone document this (e.g. when they already are using Import-User and an appropriate comment), it is possible to delete the second revisions in the database (assuming Import-User has ID=4):
 
The '''Web interface''' under Special:Import will create extra revisions (in addition to those imported) designating the importing user. If you don't want to document who did a transfer, it may therefore be desirable to use the command-line version (see below). For the web import it may be desirable to create a special "Import-User" so that the name better documents authorship than using a normal username during upload of the xml file. Important creates two revisions for each page: Revision 1 is the imported revision, Revision 2 is the revision documenting the import process. If the imported data alone document this (e.g. when they already are using Import-User and an appropriate comment), it is possible to delete the second revisions in the database (assuming Import-User has ID=4):
  
 +
<source lang="mysql">
 
  Delete FROM PREFIX_revision  
 
  Delete FROM PREFIX_revision  
 
   WHERE PREFIX_revision.rev_user=4 AND PREFIX_revision.rev_minor_edit=1;
 
   WHERE PREFIX_revision.rev_user=4 AND PREFIX_revision.rev_minor_edit=1;
  --Then need to fix the latest revision stored in page:
+
  -- Then need to fix the latest revision stored in page:
 
  UPDATE PREFIX_revision AS R2 INNER JOIN  
 
  UPDATE PREFIX_revision AS R2 INNER JOIN  
 
   (PREFIX_page LEFT JOIN PREFIX_revision AS R1 ON PREFIX_page.page_latest=R1.rev_id)  
 
   (PREFIX_page LEFT JOIN PREFIX_revision AS R1 ON PREFIX_page.page_latest=R1.rev_id)  
 
   ON R2.rev_page=PREFIX_page.page_id  
 
   ON R2.rev_page=PREFIX_page.page_id  
 
   SET page_latest=R2.rev_id WHERE R1.rev_id Is Null
 
   SET page_latest=R2.rev_id WHERE R1.rev_id Is Null
 
+
</source>
  
 
----
 
----
  
 
See also: [[Batch importing files into MediaWiki]] (that is: images, etc.)
 
See also: [[Batch importing files into MediaWiki]] (that is: images, etc.)

Revision as of 23:10, 18 November 2010

Note: This is about the text of wiki pages using an xml format. See also: Batch importing files into MediaWiki (e.g. for images)


Notes:

1. it is possible to create mediawiki xml from Microsoft Access tables and queries. However, when pasting this to a text editor, the following has to be observed:

  1. Putting all into one field will often fail, because problems occur when calculated fields exceed a certain size
  2. Exporting in multiple columns may work better. The following needs to be post-fixed in the text:
    1. remove first line with field names
    2. remove tabulator characters
    3. fix double-quote escaping (both in xml attributes (preserve) and inside the element content):

"<text to <text and </page>" to </page> (normally not necessary: "<page> and </comment>"); replace "" with ".

  1. Normally, multiple revision elements are in a single page element. It is possible to import them in separate page elements however (this greatly simplifies some imports!)
  2. When importing through the web interface, additional versions are created, with the date of import. In this case the sequence of imports rather than dates counts, because these additional versions get the date/time of import! - Avoid using the web interface, when importing versions!

Note: html-entities must be encoded, e.g.   must be encoded at &nbsp;!

Creating an export to be then reimported

Use Special:Export or:

 php dumpBackup.php --full >d:\backup\dump.xml
 cd /var/www/testwiki; php ./maintenance/dumpBackup.php --full --conf ./LocalSettings.php > ./EXPORT.xml

(See also DumpBackup.php)

Importing Data from Command Line Interface

This is the preferred method, as it does not create additional versions (compare next section).

Import works directly with 7z (or zip, bzip) compressed xml files! HOWEVER, PRESENTLY .7z DOES NOT WORKTransfer the xml file to the server, and execute (example):

 cd /var/www/v-xxx/w; sudo php ./maintenance/importDump.php /var/www/v-xxx/w/import.xml --conf ./LocalSettings.php
 cd /var/www/v-xxx/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 cd /var/www/v-xxx/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 # e.g. FOR OPENMEDIA:
 cd /var/www/v-species/o; sudo php ./maintenance/importDump.php ./atmp/import.xml --conf ./LocalSettings.php
 cd /var/www/v-species/o; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 cd /var/www/v-species/o; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3
 # e.g. FOR Naturführer:
 cd /var/www/v-on/w; sudo php ./maintenance/importDump.php ./atmp/import.xml --conf ./LocalSettings.php
 cd /var/www/v-on/w; sudo php ./maintenance/rebuildall.php --conf ./LocalSettings.php
 cd /var/www/v-on/w; sudo php ./maintenance/runJobs.php    --conf ./LocalSettings.php --procs=3

(Rebuilding internal indices is necessary after import; rebuildall may be slow and can be replaced with

 cd /var/www/v-on/w; sudo php ./maintenance/rebuildrecentchanges.php    --conf ./LocalSettings.php

if necessary. RunJobs: if import contains complex template relations or when updating template relations, data entries in templates, manually emptying the job queue may be necessary, check Special:Statistics in the wiki. Note: "--procs=3" will run three jobs in parallel, if the server has the necessary number of processor cores.)

Important: for all batchimporting, revisiondate must be set to something newer than all old revisions; else mediawiki will sort the imported revision behind existing revisions. The id in the imported xml is not necessary, however.

Batch deleting pages

Example:

 sudo php ./maintenance/deleteBatch.php \
   --conf ./LocalSettings.php \
   -r "remove wrong resolution" ./maintenance/deleteBatch.txt

deleteBatch.txt contains only the filenames.

Note that for File:x.jpg pages, this will delete the file itself AND the page itself, but the file will still seem to exist when called. Only manually clicking "delete all" in file history will finish this. This is probably a bug, the php code attempts to handle file deletions.

Importing Data through Special Pages Web Interface

The Web interface under Special:Import will create extra revisions (in addition to those imported) designating the importing user. If you don't want to document who did a transfer, it may therefore be desirable to use the command-line version (see below). For the web import it may be desirable to create a special "Import-User" so that the name better documents authorship than using a normal username during upload of the xml file. Important creates two revisions for each page: Revision 1 is the imported revision, Revision 2 is the revision documenting the import process. If the imported data alone document this (e.g. when they already are using Import-User and an appropriate comment), it is possible to delete the second revisions in the database (assuming Import-User has ID=4):

 Delete FROM PREFIX_revision 
   WHERE PREFIX_revision.rev_user=4 AND PREFIX_revision.rev_minor_edit=1;
 -- Then need to fix the latest revision stored in page:
 UPDATE PREFIX_revision AS R2 INNER JOIN 
   (PREFIX_page LEFT JOIN PREFIX_revision AS R1 ON PREFIX_page.page_latest=R1.rev_id) 
   ON R2.rev_page=PREFIX_page.page_id 
   SET page_latest=R2.rev_id WHERE R1.rev_id Is Null

See also: Batch importing files into MediaWiki (that is: images, etc.)