Difference between revisions of "Elasticsearch"

From Biowikifarm Metawiki
Jump to: navigation, search
(first documentation)
 
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Requirements ==
 
== Requirements ==
 +
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 8: Line 9:
 
| java, composer || > REL1_28 || 2.x
 
| java, composer || > REL1_28 || 2.x
 
|}
 
|}
 +
 +
Extension:CirrusSearch can handle ''only certain versions'' of elasticsearch, check the extension’s documentation and README for that. Do '''not install other versions''' of elasticsearch.
  
 
Details:
 
Details:
 
* [https://www.mediawiki.org/wiki/Extension:CirrusSearch Extension:CirrusSearch] <br />require ElasticSearch 1.x (below version REL1_28)<br />require ElasticSearch 2.x since [https://phabricator.wikimedia.org/rECIR77d8f75681917544e15dc0a04fcc9a6a24a8d508 REL1_28, REL1_29, es2.x, es5, master, wmf/1.30.0-wmf.10, wmf/1.30.0-wmf.5, wmf/1.30.0-wmf.6, wmf/1.30.0-wmf.7, wmf/1.30.0-wmf.9]
 
* [https://www.mediawiki.org/wiki/Extension:CirrusSearch Extension:CirrusSearch] <br />require ElasticSearch 1.x (below version REL1_28)<br />require ElasticSearch 2.x since [https://phabricator.wikimedia.org/rECIR77d8f75681917544e15dc0a04fcc9a6a24a8d508 REL1_28, REL1_29, es2.x, es5, master, wmf/1.30.0-wmf.10, wmf/1.30.0-wmf.5, wmf/1.30.0-wmf.6, wmf/1.30.0-wmf.7, wmf/1.30.0-wmf.9]
 
* [https://www.mediawiki.org/wiki/Extension:Elastica Extension:Elastica], composer
 
* [https://www.mediawiki.org/wiki/Extension:Elastica Extension:Elastica], composer
 
  
 
== Installation ElasticSearch ==
 
== Installation ElasticSearch ==
Line 63: Line 65:
 
  # Continue with installation? [y/N] y
 
  # Continue with installation? [y/N] y
 
  # Installed mapper-attachments into /usr/share/elasticsearch/plugins/mapper-attachments
 
  # Installed mapper-attachments into /usr/share/elasticsearch/plugins/mapper-attachments
</syntaxhighlight>  
+
</syntaxhighlight>
 +
 
 +
=== Note GH: upgradable Elasticsearch version 6x (2017) ===
 +
 
 +
To obtain an upgrading version using apt-get, use:
 +
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
 +
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
 +
sudo apt-get update; sudo apt-get upgrade; sudo apt-get install elasticsearch;
 +
 
 +
However, this will currently install elasticsearch 6.x, which requires java 8 and may not be compatible with older mediawiki versions.
 +
This was tested in 2017-12, but ultimately not installed.
 +
 
 +
== TODOS ElasticSearch ==
 +
 
 +
PDF text indexing works with mapper-attachments 1.7 (and Extension: PdfHandler) but not for DOC files. Solution could be to use <code>copy_to</code> mapping and
 +
<syntaxhighlight lang="php">
 +
// LocalSettings.php
 +
$wgHooks['CirrusSearchMappingConfig'][] = function ( array &$config, $mappingConfigBuilder ) {
 +
  // ... add mapping here
 +
  $config['page']['properties']['file_attachment'] = [
 +
    'type' => 'attachment',
 +
    "fields" => [
 +
      "content" => [
 +
        "type" => "string",
 +
        "copy_to" => ["all", "file_text"],
 +
      ]
 +
    ]
 +
  ];
 +
};
 +
</syntaxhighlight>
 +
But this <code>copy_to</code> does not work somehow with 1.7.
 +
 
 +
In mapper-attachment 2.4 it is more complicated and not yet solved but needs updating the PHP extension code:
 +
* Follow perhaps [https://www.mediawiki.org/wiki/Topic:Rn3a2nl7f63wnuaj Topic on Extension talk:CirrusSearch “Search inside uploaded documents” (www.mediawiki.org)].
 +
* [https://www.mediawiki.org/wiki/Topic:Tvd2nqclcyf74j8m Topic Integrate mapper-attachment-plugin to Extension:CirrusSearch? (www.mediawiki.org)]
  
 
== Modifications for first install ==
 
== Modifications for first install ==
Line 93: Line 129:
  
  
=== ElasticSearch Configrations ===
+
=== ElasticSearch Configurations ===
 
   
 
   
 
Read and consult:
 
Read and consult:
 
; In general (ES_HEAP_SIZE): https://www.elastic.co/blog/a-heap-of-trouble
 
; In general (ES_HEAP_SIZE): https://www.elastic.co/blog/a-heap-of-trouble
; ElasticSearch Version 1.7 :  https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-configuration.html<br />* https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-service.html
+
; ElasticSearch Version 1.7 :  <ul><li>https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-configuration.html<li>https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-service.html</ul>
; ElasticSearch Version 2.4 : https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-configuration.html<br />* https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-service.html
+
; ElasticSearch Version 2.4 : <ul><li>https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-configuration.html<li>https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-service.html</ul>
  
 
Set some configuration variables
 
Set some configuration variables
Line 107: Line 143:
 
  #  affects also name of log file in /var/log/elasticsearch/, e.g. biowikifarm-prod.log
 
  #  affects also name of log file in /var/log/elasticsearch/, e.g. biowikifarm-prod.log
 
  # set node.name: ${HOSTNAME}
 
  # set node.name: ${HOSTNAME}
</syntaxhighlight>  
+
</syntaxhighlight>
  
 
=== Check for System Service ===
 
=== Check for System Service ===
Line 152: Line 188:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
(1) Add this to LocalSettings.php:
+
=== Indexing Pages ===
 +
 
 +
(0) Include extensions first:
 +
<syntaxhighlight lang="php">
 +
require_once( "$IP/extensions/Elastica/Elastica.php" );
 +
require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" );
 +
</syntaxhighlight>
 +
 
 +
(1) Set <code>$wgDisableSearchUpdate</code>:
 
<syntaxhighlight lang="php">
 
<syntaxhighlight lang="php">
require_once( "$IP/extensions/Elastica/Elastica.php" );
+
$wgDisableSearchUpdate = true;
require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" );
+
$wgDisableSearchUpdate = true;
+
 
</syntaxhighlight>
 
</syntaxhighlight>
  
(2) There are other <code>$wgCirrusSearch</code> variables that you might want to change from their defaults. Read <code>/extensions/CirrusSearch/CirrusSearch.php</code>
+
(2) Set the variables you need to set: there are other <code>$wgCirrusSearch</code> variables that you might want to change from their defaults:
 +
* read <code>/extensions/CirrusSearch/CirrusSearch.php</code>
 +
* dig perhaps through MediaWiki settings (https://noc.wikimedia.org/conf/), e.g. https://noc.wikimedia.org/conf/highlight.php?file=CirrusSearch-common.php, https://noc.wikimedia.org/conf/highlight.php?file=CirrusSearch-production.php
  
 
(3) Set up elasticsearch properly and make it running.
 
(3) Set up elasticsearch properly and make it running.
  
(4) Now run this script to generate your elasticsearch index for a particular wiki. Checks basic things, version, creates elasticsearch instances to write to etc. but does not yet index the Wiki pages
+
(4) Generate CirrusSearch configurations and run <code>updateSearchIndexConfig.php</code> in preparing indexing for a particular wiki. It checks basic things, versions, creates elasticsearch instances to write to etc. but does not yet index any Wiki pages. Every time you change settings relevant to elasticsearch (mapping) you need to run this maintenance script.
 +
<div class="mw-collapsible mw-collapsed" style="border-left:1px dotted black;padding-left:1em;">
 +
The detailed help info of <code>php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --help --conf LocalSettings.php</code> gives (in REL1_26) …
 +
<div class="mw-collapsible-content" style="padding-left:1em;">
 +
Script specific parameters:
 +
; --baseName: What basename to use for all indexes, defaults to wiki id
 +
; --debugCheckConfig: Print the configuration as it is checked to help debug unexpected configuration mismatches.
 +
; --indexIdentifier: Set the identifier of the index to work on. You'll need this if you have an index in production serving queries and you have to alter some portion of its configuration that cannot safely be done without rebuilding it.  Once you specify a new indexIdentifier for this wiki you'll have to run this script with the same identifier each time.  Defaults to 'current' which infers the currently in use identifier.  You can also use 'now' to set the identifier to the current time in seconds which should give you a unique identifier.
 +
; --justAllocation: Just validate the shard allocation settings.  Use when you need to apply new cache warmers but want to be sure that you won't apply any other changes at an inopportune time.
 +
; --justCacheWarmers: Just validate that the cache warmers are correct and perform no additional checking.  Use when you need to apply new cache warmers but want to be sure that you won't apply any other changes at an inopportune time.
 +
; --reindexAcceptableCountDeviation: How much can the reindexed copy of an index is allowed to deviate from the current copy without triggering a reindex failure.  Defaults to 5%.
 +
; --reindexAndRemoveOk: If the alias is held by another index then reindex all documents from that index (via the alias) to this one, swing the alias to this index, and then remove other index.  Updates performed while thisoperation is in progress will be queued up in the job queue. Defaults to false.
 +
; --reindexChunkSize: Documents per shard to reindex in a batch.  Note when changing the number of shards that the old shard size is used, not the new one.  If you see many errors submitting documents in bulk but the automatic retry as singles works then lower this number. Defaults to 100.
 +
; --reindexProcesses: Number of processes to use in reindex.  Not supported on Windows.  Defaults to 1 on Windows and 5 otherwise.
 +
; --reindexRetryAttempts: Number of times to back off and retry per failure.  Note that failures are not common but if Elasticsearch is in the process of moving a shard this can time out.  This will retry the attempt after some backoff rather than failing the whole reindex process.  Defaults to 5.
 +
; --startOver: Blow away the identified index and rebuild it with no data.
 +
</div>
 +
</div>
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
  WIKI_PATH=/var/www/testwiki2/
 
  WIKI_PATH=/var/www/testwiki2/
Line 172: Line 233:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
(5) Now remove $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to ElasticSearch.
+
(5) Now remove <code>$wgDisableSearchUpdate = true;</code> from LocalSettings.php. Updates should start heading to ElasticSearch.
 
<syntaxhighlight lang="php">
 
<syntaxhighlight lang="php">
# LocalSettings.php switch to false or remove it:
+
# LocalSettings.php switch to false or remove $wgDisableSearchUpdate
 
$wgDisableSearchUpdate = false;
 
$wgDisableSearchUpdate = false;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Next bootstrap the search index by running and indexing the actual Wiki pages. ''Note this will take some resources, especially on large wikis'':
+
Next bootstrap the search index by running and indexing the actual Wiki pages by a two-step indexing. ''Note this will take some resources, especially on large wikis'':
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
  WIKI_PATH=/var/www/testwiki2/
 
  WIKI_PATH=/var/www/testwiki2/
Line 185: Line 246:
 
  sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php
 
  sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php
 
</syntaxhighlight>
 
</syntaxhighlight>
Note that this can take some time.  For large wikis read “[[#Bootstrapping large wikis|Bootstrapping large wikis]]” below.
+
Note that this can take some time.  For large wikis (more about 10,000 pages) read “[[#Bootstrapping large wikis|Bootstrapping large wikis]]” below.
  
 
(6) Once that is complete add this to <code>LocalSettings.php</code> to funnel queries to ElasticSearch:
 
(6) Once that is complete add this to <code>LocalSettings.php</code> to funnel queries to ElasticSearch:
Line 193: Line 254:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
=== Bootstrapping or Indexing large wikis {{anchor|Bootstrapping large wikis}} ===
+
Note: removing  <code>$wgSearchType = 'CirrusSearch';</code> in <code>LocalSettings.php</code> switches back to the default search.
 +
 
 +
=== Re-Indexing ElasticSearch {{anchor|re-indexing elasticsearch}}===
 +
 
 +
Some setting changes require a complete full reindexing. You may read official documentation and recommendation first and then follow these steps. If the re-indexing is not done as the README describes it can cause RuntimeExceptions or it finds nothing.
 +
 
 +
(1) Repeat steps similar to Indexing
 +
<syntaxhighlight lang="php">
 +
# LocalSettings.php
 +
$wgDisableSearchUpdate = true;
 +
# $wgSearchType = 'CirrusSearch'; # uncomment usually switches back to default search
 +
</syntaxhighlight>
 +
To have at least the default wiki search available, during this time, make sure the search text index is up to date:
 +
cd /path/to/v-wiki
 +
sudo -u www-data php ./maintenance/rebuildtextindex.php --conf ./LocalSettings.php
 +
 
 +
(2) Now run <code>updateSearchIndexConfig.php</code> to generate your elasticsearch index configuration for a particular wiki. Checks basic things, version, creates elasticsearch instances to write to etc. but does not yet index the Wiki pages:
 +
<syntaxhighlight lang="bash">
 +
WIKI_PATH=/var/www/testwiki2/
 +
# WIKI_PATH=/var/www/v-species/s/
 +
# WIKI_PATH=/var/www/v-species/o/
 +
cd $WIKI_PATH
 +
sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --conf LocalSettings.php
 +
</syntaxhighlight>
 +
 
 +
(3) Now remove or set to ''false'' the <code>$wgDisableSearchUpdate = true</code> from LocalSettings.php. Updates should start heading to ElasticSearch.
 +
<syntaxhighlight lang="php">
 +
# LocalSettings.php switch to false or remove it:
 +
$wgDisableSearchUpdate = false;
 +
</syntaxhighlight>
 +
 
 +
(4) Repeat bootstrapping the search index by running and indexing the actual Wiki pages. This takes a considerable amount of CPU resources. One step at a time(!) follow the process on the terminal screen:
 +
<syntaxhighlight lang="bash">
 +
WIKI_PATH=/var/www/testwiki2/
 +
cd $WIKI_PATH
 +
# 1st STEP: takes a long time (openmendia 3 to 4 days)
 +
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php
 +
  # if it got interrupted, further from a particular page-ID e.g. --fromId 180100
 +
  sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --fromId 180100 --conf LocalSettings.php
 +
# 2nd STEP: runs faster, skip parsing the page
 +
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php
 +
</syntaxhighlight>
 +
 
 +
If you have a large Wiki you can only break down the indexing to smaller job scripts (see [[#Bootstrapping large wikis|Bootstrapping or Indexing Large Wikis]])
 +
 
 +
(5) Switch on CirrusSearch again:
 +
<syntaxhighlight lang="php">
 +
# LocalSettings.php
 +
$wgDisableSearchUpdate = false;
 +
$wgSearchType = 'CirrusSearch'; # use cirrus search again
 +
</syntaxhighlight>
 +
 
 +
=== Bootstrapping or Indexing Large Wikis {{anchor|Bootstrapping large wikis}} ===
  
(modified from README of extension:CirrusSearch REL1_26)
+
(modified from README of extension:CirrusSearch REL1_26 but scripts are untested so far 20171025)
  
 
Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the
 
Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the
Line 205: Line 318:
 
<li> Make sure you have a good job queue setup.  It'll be doing most of the work.  In fact, Cirrus won't work well on large wikis without it.
 
<li> Make sure you have a good job queue setup.  It'll be doing most of the work.  In fact, Cirrus won't work well on large wikis without it.
 
<li> Generate scripts to add all the pages without link counts to the index.
 
<li> Generate scripts to add all the pages without link counts to the index.
<li> Execute them any way you like.
+
<li> Execute them any way you like. (on biowikifarm for large wikis I recommend a time, when there is less traffic)
 
<li> Generate scripts to count all the links.
 
<li> Generate scripts to count all the links.
 
<li> Execute them any way you like.
 
<li> Execute them any way you like.
Line 211: Line 324:
  
 
The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between.
 
The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between.
 +
 +
<div class="mw-collapsible mw-collapsed" style="border-left:1px dotted black;padding-left:1em;">
 +
Help info of <code>php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --help --conf LocalSettings.php</code> gives (in REL1_26) …
 +
<div class="mw-collapsible-content" style="padding-left:1em;">
 +
Script specific parameters:
 +
; --buildChunks: Instead of running the script spit out commands that can be farmed out to different processes or machines to rebuild the index.  Works with fromId and toId, not from and to.  If specified as a number then chunks no larger than that size are spat out.  If specified as a number followed by the word "total" without a space between them then that many chunks will be spat out sized to cover the entire wiki.
 +
; --deletes: If this is set then just index deletes, not updates or creates.
 +
; --from: Start date of reindex in YYYY-mm-ddTHH:mm:ssZ (exc. Defaults to 0 epoch.
 +
; --fromId: Start indexing at a specific page_id.  Not useful with <code>--deletes</code>.
 +
; --indexOnSkip: When skipping either parsing or links send the document as an index.  This replaces the contents of the index for that entry with the entry built from a skipped process.Without this if the entry does not exist then it will be skipped entirely.  Only set this when running the first pass of building the index.  Otherwise, don’t tempt fate by indexing half complete documents.
 +
; --limit: Maximum number of pages to process before exiting the script. Default to unlimited.
 +
; --maxJobs: If there are more than this many index jobs in the queue then pause before adding more.  This is only checked every 3 seconds. Not meaningful without <code>--queue</code>.
 +
; --namespace: Only index pages in this given namespace
 +
; --pauseForJobs: If paused adding jobs then wait for there to be less than this many before starting again.  Defaults to the value specified for <code>--maxJobs</code>.  Not meaningful without <code>--queue</code>.
 +
; --queue: Rather than perform the indexes in process add them to the job queue.  Ignored for delete.
 +
; --skipLinks: Skip looking for links to the page (counting and finding redirects).  Use this with <code>--indexOnSkip</code> for the first half of the two phase index build.
 +
; --skipParse: Skip parsing the page.  This is really only good for running the second half of the two phase index build.  If this is specified then the default batch size is actually 50.
 +
; --to: Stop date of reindex in YYYY-mm-ddTHH:mm:ssZ.  Defaults to now.
 +
; --toId: Stop indexing at a specific page_id.  Not useful with <code>--deletes</code> or <code>--from</code> or <code>--to</code>.
 +
</div>
 +
</div>
 +
 +
Step 1:
 +
* generate 1st scripts for a wiki bootstrapping procedure
 +
* decide for a reasonable amount of part-jobs, set <code>--maxJobs</code>, <code>--buildChunks</code> accordingly in the range of 5,000 to 10,000 (larger ones caused the queue to halt in between on biowikifarm)
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
# Step 1: generate 1st scripts for a wiki bootstrapping procedure
 
 
export N_SCRIPT_FILES=10 # or whatever number you want
 
export N_SCRIPT_FILES=10 # or whatever number you want
 
wiki="openmedia"
 
wiki="openmedia"
Line 264: Line 401:
  
 
Step 4:
 
Step 4:
Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run
+
Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run
 
them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load
 
them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load
 
spikes.
 
spikes.
  
If you don't have a good job queue you can try the above but lower the buildChunks parameter significantly and
+
If you don't have a good job queue you can try the above but lower the <code>--buildChunks</code> parameter significantly and
remove the --queue parameter.
+
remove the <code>--queue</code> parameter.
 +
 
 +
== Problem Solving and verification of function/health ==
 +
 
 +
ElasticSearch is managed via an API over http://localhost for this inquire and read into documentation of “cluster API” (massive, detailed reports) and the “cat API” (for brief statistics):
 +
; Version 1.7 : https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster.html<br/>https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cat.html
 +
; Version 2.4 : https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cluster.html<br/>https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat.html
 +
 
 +
General things to check:
 +
* is the system service running? (see <code>sudo service elasticsearch status</code>)
 +
* what is reported in the logs? (see /var/log/elasticsearch/)
 +
* how are the data and the elasticsearch cluster?
 +
* you may try to [[#re-indexing elasticsearch|re-index the data]] with the maintenance scripts of extension:CirrusSearch
 +
 
 +
If nothing can be done to fix the data you have to
 +
# delete the ES data indices and
 +
# recreate and re-index all the data from scratch again (note: for openmedia it takes 3 to 4 days)
 +
 
 +
To test/check the health of all available indices from the command line run:
 +
<syntaxhighlight lang="bash">
 +
# list all the indices
 +
# **indices** command provides a cross-section of each index. This information spans nodes
 +
  curl -XGET 'http://localhost:9200/_cat/indices?v'
 +
</syntaxhighlight>
 +
 
 +
Check data cluster first with various commands:
 +
<syntaxhighlight lang="bash">
 +
# show the **cluster health** and status
 +
  curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
 +
# show the **indices and status**
 +
  curl -XGET 'http://localhost:9200/_cat/indices?v'
 +
# show the **cluster state** (green: OK, yellow: not OK, red: major failure)
 +
  curl -XGET 'http://localhost:9200/_cluster/state?pretty=true'
 +
# _cat/recovery a view of index shard recoveries, both on-going and previously completed
 +
  curl -XGET 'http://localhost:9200/_cat/recovery?v'
 +
# show **pending tasks**
 +
  curl -XGET 'http://localhost:9200/_cat/pending_tasks?v'
 +
# **nodes**
 +
  curl -XGET 'http://localhost:9200/_nodes?pretty=true' # massive detailed node infos
 +
  curl -XGET 'http://localhost:9200/_nodes/stats?pretty=true'  # massive detailed node statistical infos
 +
  curl -XGET 'http://localhost:9200/_cat/nodes?v' # _cat/nodes command shows the cluster topology
 +
# **allocation** provides a snapshot of how shards have located around the cluster and the state of disk usage
 +
  curl -XGET 'http://localhost:9200/_cat/allocation?v'
 +
# **shards** command is the detailed view of what nodes contain which shards
 +
  curl -XGET 'http://localhost:9200/_cat/shards?v'
 +
</syntaxhighlight>
 +
 
 +
If nothing helps delete individual indices:
 +
<syntaxhighlight lang="bash">
 +
# **delete** indices of e.g. onwiki
 +
  curl -XDELETE 'localhost:9200/onwiki_content?pretty'
 +
  curl -XDELETE 'localhost:9200/onwiki_content_first?pretty'
 +
</syntaxhighlight>
 +
… and do the [[#re-indexing elasticsearch|re-indexing procedure]]
 +
 
 +
=== Handling ElasticSearch Outages ===
 +
 
 +
See README of extension CirrusSearch.
 +
 
 +
== Upgrading ==
 +
 
 +
See README of extension CirrusSearch.
  
[[Category:Extensions]]
+
[[Category: MediaWiki Extensions]]
 +
[[Category:Trouble shooting]]

Latest revision as of 01:30, 16 January 2018

Requirements

Requirement Extension:CirrusSearch
Extension:Elastica
ElasticSearch
java, composer < REL1_28 1.x
java, composer > REL1_28 2.x

Extension:CirrusSearch can handle only certain versions of elasticsearch, check the extension’s documentation and README for that. Do not install other versions of elasticsearch.

Details:

Installation ElasticSearch

It is not recommended here in this case to use a package manager for installing elasticsearch, because MediaWiki Extension:CirrusSearch depends on specific versions of it and upgrading may cause the search not to work properly.

ElasticSearch Version 1.7

 cd ~
 wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.6.deb
 sudo dpkg --install elasticsearch-1.7.6.deb
 # Unpacking elasticsearch (from elasticsearch-1.7.6.deb) ...
 # Setting up elasticsearch (1.7.6) ...

Add plugin elasticsearch-mapper-attachments for searching files. You need a compatible version to the elasticsearch version, read documentation at https://github.com/elastic/elasticsearch-mapper-attachments

 sudo /usr/share/elasticsearch/bin/plugin install elasticsearch/elasticsearch-mapper-attachments/2.7.1

ElasticSearch Version 2.4

Go to https://www.elastic.co/de/downloads/past-releases, select product and version of 2.4-something

 cd ~
 wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.5/elasticsearch-2.4.5.deb
 # https://www.elastic.co/guide/en/elasticsearch/reference/2.4/release-notes-2.4.5.html
 sudo dpkg --install elasticsearch-2.4.5.deb
 # Unpacking elasticsearch (from elasticsearch-2.4.5.deb) ...
 # Creating elasticsearch group... OK
 # Creating elasticsearch user... OK
 # Setting up elasticsearch (2.4.5) ...

Plugins see https://www.elastic.co/guide/en/elasticsearch/plugins/2.4/index.html. Add elasticsearch plugin for file search:

 sudo /usr/share/elasticsearch/bin/plugin install mapper-attachments # deprecated in elasticsearch 5+
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 # @     WARNING: plugin requires additional permissions     @
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 # * java.lang.RuntimePermission getClassLoader
 # * java.lang.reflect.ReflectPermission suppressAccessChecks
 # * java.security.SecurityPermission createAccessControlContext
 # * java.security.SecurityPermission insertProvider
 # * java.security.SecurityPermission insertProvider.BC
 # * java.security.SecurityPermission putProviderProperty.BC
 # See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
 # for descriptions of what these permissions allow and the associated risks.
 # 
 # Continue with installation? [y/N] y
 # Installed mapper-attachments into /usr/share/elasticsearch/plugins/mapper-attachments

Note GH: upgradable Elasticsearch version 6x (2017)

To obtain an upgrading version using apt-get, use:

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update; sudo apt-get upgrade; sudo apt-get install elasticsearch;

However, this will currently install elasticsearch 6.x, which requires java 8 and may not be compatible with older mediawiki versions. This was tested in 2017-12, but ultimately not installed.

TODOS ElasticSearch

PDF text indexing works with mapper-attachments 1.7 (and Extension: PdfHandler) but not for DOC files. Solution could be to use copy_to mapping and

// LocalSettings.php
$wgHooks['CirrusSearchMappingConfig'][] = function ( array &$config, $mappingConfigBuilder ) {
  // ... add mapping here
  $config['page']['properties']['file_attachment'] = [
    'type' => 'attachment',
    "fields" => [
      "content" => [
        "type" => "string",
        "copy_to" => ["all", "file_text"],
      ]
    ]
  ];
};

But this copy_to does not work somehow with 1.7.

In mapper-attachment 2.4 it is more complicated and not yet solved but needs updating the PHP extension code:

Modifications for first install

Check installed files of the package:

 dpkg-query -L elasticsearch # check installed files of the package
 # /etc/init.d/elasticsearch               - system service, also configuration and system variables
 # /usr/lib/sysctl.d/elasticsearch.conf    - contains system service configuration
 # /etc/elasticsearch/                     - contains elasticsearch configuration see elasticsearch.yml
 # /usr/share/elasticsearch/bin/plugin     - binary for plugin management
 # /usr/share/elasticsearch/NOTICE.txt     - documentation
 # /usr/share/elasticsearch/README.textile - documentation
 # /var/lib/elasticsearch                  - default data directory (set in /etc/init.d/elasticsearch)

Modify data directory (set in /etc/init.d/elasticsearch):

 if ! [[ -d /mnt/storage/var-lib-elasticsearch ]]; then 
   sudo mkdir /mnt/storage/var-lib-elasticsearch;
   sudo mv /var/lib/elasticsearch  /var/lib/elasticsearch-bak
   cd /var/lib # make elasticsearch store data to storage device, i.e. symlink it
   sudo ln -s --force /mnt/storage/var-lib-elasticsearch elasticsearch
   sudo chown elasticsearch:elasticsearch -R /mnt/storage/var-lib-elasticsearch
   sudo chown elasticsearch:elasticsearch -R elasticsearch # set symlink
   # sudo rm --interactive /var/lib/elasticsearch-bak
 fi


ElasticSearch Configurations

Read and consult:

In general (ES_HEAP_SIZE)
https://www.elastic.co/blog/a-heap-of-trouble
ElasticSearch Version 1.7 
ElasticSearch Version 2.4 

Set some configuration variables

 sudo vi /etc/init.d/elasticsearch # set ES_HEAP_SIZE not clear how much
 sudo vi /etc/elasticsearch/elasticsearch.yml
 # set cluster.name: biowikifarm-prod
 #   affects also name of log file in /var/log/elasticsearch/, e.g. biowikifarm-prod.log
 # set node.name: ${HOSTNAME}

Check for System Service

SysV init vs systemd: ElasticSearch is not started automatically after installation. How to start and stop ElasticSearch depends on whether your system uses SysV init or systemd (used by newer distributions). You can tell which is being used by running this command:

ps --pid 1 # init

Running ElasticSearch with SysV init: Use the update-rc.d command to configure ElasticSearch to start automatically when the system boots up:

 # update-rc.d name-of-service defaults Start-Sequence-Number  Kill-Sequence-Number
 sudo update-rc.d elasticsearch defaults 95 10

ElasticSearch can be started and stopped using the service command:

sudo -i service elasticsearch status
sudo -i service elasticsearch start
sudo -i service elasticsearch stop

If ElasticSearch fails to start for any reason, it will print the reason for failure to STDOUT. Log files can be found by default in /var/log/elasticsearch/

Set up MediaWiki Extensions and Configurations

Read

Install or upgrade extensions via bash script create_or_upgrade_shared-wiki-extensions.sh to the version you need.

Then set up Extension:Elastica once:

 cd /usr/share/mediawiki26/extensions-rich-features/Elastica
 sudo -u www-data /usr/local/bin/composer.phar install --no-dev
 
 cd /usr/share/mediawiki26/extensions-simple-features/Elastica
 sudo -u www-data /usr/local/bin/composer.phar install --no-dev

Indexing Pages

(0) Include extensions first:

require_once( "$IP/extensions/Elastica/Elastica.php" );
require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" );

(1) Set $wgDisableSearchUpdate:

$wgDisableSearchUpdate = true;

(2) Set the variables you need to set: there are other $wgCirrusSearch variables that you might want to change from their defaults:

(3) Set up elasticsearch properly and make it running.

(4) Generate CirrusSearch configurations and run updateSearchIndexConfig.php in preparing indexing for a particular wiki. It checks basic things, versions, creates elasticsearch instances to write to etc. but does not yet index any Wiki pages. Every time you change settings relevant to elasticsearch (mapping) you need to run this maintenance script.

The detailed help info of php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --help --conf LocalSettings.php gives (in REL1_26) …

Script specific parameters:

--baseName
What basename to use for all indexes, defaults to wiki id
--debugCheckConfig
Print the configuration as it is checked to help debug unexpected configuration mismatches.
--indexIdentifier
Set the identifier of the index to work on. You'll need this if you have an index in production serving queries and you have to alter some portion of its configuration that cannot safely be done without rebuilding it. Once you specify a new indexIdentifier for this wiki you'll have to run this script with the same identifier each time. Defaults to 'current' which infers the currently in use identifier. You can also use 'now' to set the identifier to the current time in seconds which should give you a unique identifier.
--justAllocation
Just validate the shard allocation settings. Use when you need to apply new cache warmers but want to be sure that you won't apply any other changes at an inopportune time.
--justCacheWarmers
Just validate that the cache warmers are correct and perform no additional checking. Use when you need to apply new cache warmers but want to be sure that you won't apply any other changes at an inopportune time.
--reindexAcceptableCountDeviation
How much can the reindexed copy of an index is allowed to deviate from the current copy without triggering a reindex failure. Defaults to 5%.
--reindexAndRemoveOk
If the alias is held by another index then reindex all documents from that index (via the alias) to this one, swing the alias to this index, and then remove other index. Updates performed while thisoperation is in progress will be queued up in the job queue. Defaults to false.
--reindexChunkSize
Documents per shard to reindex in a batch. Note when changing the number of shards that the old shard size is used, not the new one. If you see many errors submitting documents in bulk but the automatic retry as singles works then lower this number. Defaults to 100.
--reindexProcesses
Number of processes to use in reindex. Not supported on Windows. Defaults to 1 on Windows and 5 otherwise.
--reindexRetryAttempts
Number of times to back off and retry per failure. Note that failures are not common but if Elasticsearch is in the process of moving a shard this can time out. This will retry the attempt after some backoff rather than failing the whole reindex process. Defaults to 5.
--startOver
Blow away the identified index and rebuild it with no data.
 WIKI_PATH=/var/www/testwiki2/
 # WIKI_PATH=/var/www/v-species/s/
 # WIKI_PATH=/var/www/v-species/o/
 cd $WIKI_PATH
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --conf LocalSettings.php

(5) Now remove $wgDisableSearchUpdate = true; from LocalSettings.php. Updates should start heading to ElasticSearch.

# LocalSettings.php switch to false or remove $wgDisableSearchUpdate
$wgDisableSearchUpdate = false;

Next bootstrap the search index by running and indexing the actual Wiki pages by a two-step indexing. Note this will take some resources, especially on large wikis:

 WIKI_PATH=/var/www/testwiki2/
 cd $WIKI_PATH
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php # takes a long time
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php

Note that this can take some time. For large wikis (more about 10,000 pages) read “Bootstrapping large wikis” below.

(6) Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch:

# LocalSettings.php switch Searchtype
 $wgSearchType = 'CirrusSearch';

Note: removing $wgSearchType = 'CirrusSearch'; in LocalSettings.php switches back to the default search.

Re-Indexing ElasticSearch

Some setting changes require a complete full reindexing. You may read official documentation and recommendation first and then follow these steps. If the re-indexing is not done as the README describes it can cause RuntimeExceptions or it finds nothing.

(1) Repeat steps similar to Indexing

# LocalSettings.php
$wgDisableSearchUpdate = true;
# $wgSearchType = 'CirrusSearch'; # uncomment usually switches back to default search

To have at least the default wiki search available, during this time, make sure the search text index is up to date:

cd /path/to/v-wiki
sudo -u www-data php ./maintenance/rebuildtextindex.php --conf ./LocalSettings.php

(2) Now run updateSearchIndexConfig.php to generate your elasticsearch index configuration for a particular wiki. Checks basic things, version, creates elasticsearch instances to write to etc. but does not yet index the Wiki pages:

 WIKI_PATH=/var/www/testwiki2/
 # WIKI_PATH=/var/www/v-species/s/
 # WIKI_PATH=/var/www/v-species/o/
 cd $WIKI_PATH
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --conf LocalSettings.php

(3) Now remove or set to false the $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to ElasticSearch.

# LocalSettings.php switch to false or remove it:
$wgDisableSearchUpdate = false;

(4) Repeat bootstrapping the search index by running and indexing the actual Wiki pages. This takes a considerable amount of CPU resources. One step at a time(!) follow the process on the terminal screen:

 WIKI_PATH=/var/www/testwiki2/
 cd $WIKI_PATH
 # 1st STEP: takes a long time (openmendia 3 to 4 days)
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php
   # if it got interrupted, further from a particular page-ID e.g. --fromId 180100
   sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --fromId 180100 --conf LocalSettings.php
 # 2nd STEP: runs faster, skip parsing the page
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php

If you have a large Wiki you can only break down the indexing to smaller job scripts (see Bootstrapping or Indexing Large Wikis)

(5) Switch on CirrusSearch again:

# LocalSettings.php
$wgDisableSearchUpdate = false;
$wgSearchType = 'CirrusSearch'; # use cirrus search again

Bootstrapping or Indexing Large Wikis

(modified from README of extension:CirrusSearch REL1_26 but scripts are untested so far 20171025)

Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the process into multiple processes. Don't worry too much about the database during this process. It can generally handle more indexing processes then you are likely to be able to spawn.

General strategy:

  1. Make sure you have a good job queue setup. It'll be doing most of the work. In fact, Cirrus won't work well on large wikis without it.
  2. Generate scripts to add all the pages without link counts to the index.
  3. Execute them any way you like. (on biowikifarm for large wikis I recommend a time, when there is less traffic)
  4. Generate scripts to count all the links.
  5. Execute them any way you like.

The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between.

Help info of php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --help --conf LocalSettings.php gives (in REL1_26) …

Script specific parameters:

--buildChunks
Instead of running the script spit out commands that can be farmed out to different processes or machines to rebuild the index. Works with fromId and toId, not from and to. If specified as a number then chunks no larger than that size are spat out. If specified as a number followed by the word "total" without a space between them then that many chunks will be spat out sized to cover the entire wiki.
--deletes
If this is set then just index deletes, not updates or creates.
--from
Start date of reindex in YYYY-mm-ddTHH:mm:ssZ (exc. Defaults to 0 epoch.
--fromId
Start indexing at a specific page_id. Not useful with --deletes.
--indexOnSkip
When skipping either parsing or links send the document as an index. This replaces the contents of the index for that entry with the entry built from a skipped process.Without this if the entry does not exist then it will be skipped entirely. Only set this when running the first pass of building the index. Otherwise, don’t tempt fate by indexing half complete documents.
--limit
Maximum number of pages to process before exiting the script. Default to unlimited.
--maxJobs
If there are more than this many index jobs in the queue then pause before adding more. This is only checked every 3 seconds. Not meaningful without --queue.
--namespace
Only index pages in this given namespace
--pauseForJobs
If paused adding jobs then wait for there to be less than this many before starting again. Defaults to the value specified for --maxJobs. Not meaningful without --queue.
--queue
Rather than perform the indexes in process add them to the job queue. Ignored for delete.
--skipLinks
Skip looking for links to the page (counting and finding redirects). Use this with --indexOnSkip for the first half of the two phase index build.
--skipParse
Skip parsing the page. This is really only good for running the second half of the two phase index build. If this is specified then the default batch size is actually 50.
--to
Stop date of reindex in YYYY-mm-ddTHH:mm:ssZ. Defaults to now.
--toId
Stop indexing at a specific page_id. Not useful with --deletes or --from or --to.

Step 1:

  • generate 1st scripts for a wiki bootstrapping procedure
  • decide for a reasonable amount of part-jobs, set --maxJobs, --buildChunks accordingly in the range of 5,000 to 10,000 (larger ones caused the queue to halt in between on biowikifarm)
export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d  $WIKI_PATH ]]; then
  cd $WIKI_PATH
  if [[ -d cirrus_scripts ]]; then sudo rm -rf cirrus_scripts; fi;
  mkdir cirrus_scripts
  if ! [[ -d cirrus_log ]]; then mkdir cirrus_log ; fi;
  pushd cirrus_scripts
  # generate smaller page sets to work through, try different numbers depending on your experience
  sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
      --skipLinks --indexOnSkip --buildChunks 5000  --conf $WIKI_PATH/LocalSettings.php |
      sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
      split --number r/$N_SCRIPT_FILES
      # did output N_SCRIPT_FILES of FILE to xaa, xab
  # randomly sort commands within a script FILE to xaa, xab
  for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
  popd
  sudo chown www-data:www-data -R cirrus_scripts
  sudo chown www-data:www-data -R cirrus_log
fi

Step 2: check and run generated scripts

  1. check generated scripts in $WIKI_PATH/cirrus_scripts to be looking OK
  2. Just run all the scripts that step 1 made. Best to run them step by step in screen or something and in the directory above
    cd $WIKI_PATH && sudo -u www-data ./cirrus_scripts/xaa.sh
    
    and check the logs

Step 3: generate 2nd part of bootstrapping

export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d  $WIKI_PATH ]]; then
  cd $WIKI_PATH
  pushd cirrus_scripts
  rm *.sh # remove old scripts
  # generate smaller page sets to work through, try different numbers depending on your experience
  sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
      --skipParse --buildChunks 5000  --conf $WIKI_PATH/LocalSettings.php |
      sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
      split --number r/$N_SCRIPT_FILES
      # did output N_SCRIPT_FILES of FILE to xaa, xab
  # randomly sort commands within a script FILE to xaa, xab
  for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
  popd
  sudo chown www-data:www-data -R cirrus_scripts
fi

Step 4: Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load spikes.

If you don't have a good job queue you can try the above but lower the --buildChunks parameter significantly and remove the --queue parameter.

Problem Solving and verification of function/health

ElasticSearch is managed via an API over http://localhost for this inquire and read into documentation of “cluster API” (massive, detailed reports) and the “cat API” (for brief statistics):

Version 1.7 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cat.html
Version 2.4 
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cluster.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat.html

General things to check:

  • is the system service running? (see sudo service elasticsearch status)
  • what is reported in the logs? (see /var/log/elasticsearch/)
  • how are the data and the elasticsearch cluster?
  • you may try to re-index the data with the maintenance scripts of extension:CirrusSearch

If nothing can be done to fix the data you have to

  1. delete the ES data indices and
  2. recreate and re-index all the data from scratch again (note: for openmedia it takes 3 to 4 days)

To test/check the health of all available indices from the command line run:

# list all the indices
# **indices** command provides a cross-section of each index. This information spans nodes
  curl -XGET 'http://localhost:9200/_cat/indices?v'

Check data cluster first with various commands:

# show the **cluster health** and status
  curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
# show the **indices and status**
  curl -XGET 'http://localhost:9200/_cat/indices?v'
# show the **cluster state** (green: OK, yellow: not OK, red: major failure)
  curl -XGET 'http://localhost:9200/_cluster/state?pretty=true'
# _cat/recovery a view of index shard recoveries, both on-going and previously completed
  curl -XGET 'http://localhost:9200/_cat/recovery?v'
# show **pending tasks**
  curl -XGET 'http://localhost:9200/_cat/pending_tasks?v'
# **nodes**
  curl -XGET 'http://localhost:9200/_nodes?pretty=true' # massive detailed node infos
  curl -XGET 'http://localhost:9200/_nodes/stats?pretty=true'  # massive detailed node statistical infos
  curl -XGET 'http://localhost:9200/_cat/nodes?v' # _cat/nodes command shows the cluster topology
# **allocation** provides a snapshot of how shards have located around the cluster and the state of disk usage
  curl -XGET 'http://localhost:9200/_cat/allocation?v'
# **shards** command is the detailed view of what nodes contain which shards
  curl -XGET 'http://localhost:9200/_cat/shards?v'

If nothing helps delete individual indices:

# **delete** indices of e.g. onwiki
  curl -XDELETE 'localhost:9200/onwiki_content?pretty'
  curl -XDELETE 'localhost:9200/onwiki_content_first?pretty'

… and do the re-indexing procedure

Handling ElasticSearch Outages

See README of extension CirrusSearch.

Upgrading

See README of extension CirrusSearch.