Elasticsearch

From Biowikifarm Metawiki
Revision as of 15:59, 24 July 2017 by Andreas Plank (Talk | contribs) (+Handling ElasticSearch Outages, Upgrading)

Jump to: navigation, search

Requirements

Requirement Extension:CirrusSearch
Extension:Elastica
ElasticSearch
java, composer < REL1_28 1.x
java, composer > REL1_28 2.x

Details:


Installation ElasticSearch

It is not recommended here in this case to use a package manager for installing elasticsearch, because MediaWiki Extension:CirrusSearch depends on specific versions of it and upgrading may cause the search not to work properly.

ElasticSearch Version 1.7

 cd ~
 wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.6.deb
 sudo dpkg --install elasticsearch-1.7.6.deb
 # Unpacking elasticsearch (from elasticsearch-1.7.6.deb) ...
 # Setting up elasticsearch (1.7.6) ...

Add plugin elasticsearch-mapper-attachments for searching files. You need a compatible version to the elasticsearch version, read documentation at https://github.com/elastic/elasticsearch-mapper-attachments

 sudo /usr/share/elasticsearch/bin/plugin install elasticsearch/elasticsearch-mapper-attachments/2.7.1

ElasticSearch Version 2.4

Go to https://www.elastic.co/de/downloads/past-releases, select product and version of 2.4-something

 cd ~
 wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.5/elasticsearch-2.4.5.deb
 # https://www.elastic.co/guide/en/elasticsearch/reference/2.4/release-notes-2.4.5.html
 sudo dpkg --install elasticsearch-2.4.5.deb
 # Unpacking elasticsearch (from elasticsearch-2.4.5.deb) ...
 # Creating elasticsearch group... OK
 # Creating elasticsearch user... OK
 # Setting up elasticsearch (2.4.5) ...

Plugins see https://www.elastic.co/guide/en/elasticsearch/plugins/2.4/index.html. Add elasticsearch plugin for file search:

 sudo /usr/share/elasticsearch/bin/plugin install mapper-attachments # deprecated in elasticsearch 5+
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 # @     WARNING: plugin requires additional permissions     @
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 # * java.lang.RuntimePermission getClassLoader
 # * java.lang.reflect.ReflectPermission suppressAccessChecks
 # * java.security.SecurityPermission createAccessControlContext
 # * java.security.SecurityPermission insertProvider
 # * java.security.SecurityPermission insertProvider.BC
 # * java.security.SecurityPermission putProviderProperty.BC
 # See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
 # for descriptions of what these permissions allow and the associated risks.
 # 
 # Continue with installation? [y/N] y
 # Installed mapper-attachments into /usr/share/elasticsearch/plugins/mapper-attachments

TODOS ElasticSearch

Modifications for first install

Check installed files of the package:

 dpkg-query -L elasticsearch # check installed files of the package
 # /etc/init.d/elasticsearch               - system service, also configuration and system variables
 # /usr/lib/sysctl.d/elasticsearch.conf    - contains system service configuration
 # /etc/elasticsearch/                     - contains elasticsearch configuration see elasticsearch.yml
 # /usr/share/elasticsearch/bin/plugin     - binary for plugin management
 # /usr/share/elasticsearch/NOTICE.txt     - documentation
 # /usr/share/elasticsearch/README.textile - documentation
 # /var/lib/elasticsearch                  - default data directory (set in /etc/init.d/elasticsearch)

Modify data directory (set in /etc/init.d/elasticsearch):

 if ! [[ -d /mnt/storage/var-lib-elasticsearch ]]; then 
   sudo mkdir /mnt/storage/var-lib-elasticsearch;
   sudo mv /var/lib/elasticsearch  /var/lib/elasticsearch-bak
   cd /var/lib # make elasticsearch store data to storage device, i.e. symlink it
   sudo ln -s --force /mnt/storage/var-lib-elasticsearch elasticsearch
   sudo chown elasticsearch:elasticsearch -R /mnt/storage/var-lib-elasticsearch
   sudo chown elasticsearch:elasticsearch -R elasticsearch # set symlink
   # sudo rm --interactive /var/lib/elasticsearch-bak
 fi


ElasticSearch Configurations

Read and consult:

In general (ES_HEAP_SIZE)
https://www.elastic.co/blog/a-heap-of-trouble
ElasticSearch Version 1.7 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-configuration.html
* https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup-service.html
ElasticSearch Version 2.4 
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-configuration.html
* https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup-service.html

Set some configuration variables

 sudo vi /etc/init.d/elasticsearch # set ES_HEAP_SIZE not clear how much
 sudo vi /etc/elasticsearch/elasticsearch.yml
 # set cluster.name: biowikifarm-prod
 #   affects also name of log file in /var/log/elasticsearch/, e.g. biowikifarm-prod.log
 # set node.name: ${HOSTNAME}

Check for System Service

SysV init vs systemd: ElasticSearch is not started automatically after installation. How to start and stop ElasticSearch depends on whether your system uses SysV init or systemd (used by newer distributions). You can tell which is being used by running this command:

ps --pid 1 # init

Running ElasticSearch with SysV init: Use the update-rc.d command to configure ElasticSearch to start automatically when the system boots up:

 # update-rc.d name-of-service defaults Start-Sequence-Number  Kill-Sequence-Number
 sudo update-rc.d elasticsearch defaults 95 10

ElasticSearch can be started and stopped using the service command:

sudo -i service elasticsearch status
sudo -i service elasticsearch start
sudo -i service elasticsearch stop

If ElasticSearch fails to start for any reason, it will print the reason for failure to STDOUT. Log files can be found by default in /var/log/elasticsearch/

Set up MediaWiki Extensions and Configurations

Read

Install or upgrade extensions via bash script create_or_upgrade_shared-wiki-extensions.sh to the version you need.

Then set up Extension:Elastica once:

 cd /usr/share/mediawiki26/extensions-rich-features/Elastica
 sudo -u www-data /usr/local/bin/composer.phar install --no-dev
 
 cd /usr/share/mediawiki26/extensions-simple-features/Elastica
 sudo -u www-data /usr/local/bin/composer.phar install --no-dev

(1) Add this to LocalSettings.php:

 require_once( "$IP/extensions/Elastica/Elastica.php" );
 require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" );
 $wgDisableSearchUpdate = true;

(2) There are other $wgCirrusSearch variables that you might want to change from their defaults. Read /extensions/CirrusSearch/CirrusSearch.php

(3) Set up elasticsearch properly and make it running.

(4) Now run this script to generate your elasticsearch index for a particular wiki. Checks basic things, version, creates elasticsearch instances to write to etc. but does not yet index the Wiki pages

 WIKI_PATH=/var/www/testwiki2/
 # WIKI_PATH=/var/www/v-species/s/
 # WIKI_PATH=/var/www/v-species/o/
 cd $WIKI_PATH
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --conf LocalSettings.php

(5) Now remove $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to ElasticSearch.

# LocalSettings.php switch to false or remove it:
$wgDisableSearchUpdate = false;

Next bootstrap the search index by running and indexing the actual Wiki pages. Note this will take some resources, especially on large wikis:

 WIKI_PATH=/var/www/testwiki2/
 cd $WIKI_PATH
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php # takes a long time
 sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php

Note that this can take some time. For large wikis read “Bootstrapping large wikis” below.

(6) Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch:

# LocalSettings.php switch Searchtype
 $wgSearchType = 'CirrusSearch';

Bootstrapping or Indexing large wikis

(modified from README of extension:CirrusSearch REL1_26)

Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the process into multiple processes. Don't worry too much about the database during this process. It can generally handle more indexing processes then you are likely to be able to spawn.

General strategy:

  1. Make sure you have a good job queue setup. It'll be doing most of the work. In fact, Cirrus won't work well on large wikis without it.
  2. Generate scripts to add all the pages without link counts to the index.
  3. Execute them any way you like.
  4. Generate scripts to count all the links.
  5. Execute them any way you like.

The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between.

# Step 1: generate 1st scripts for a wiki bootstrapping procedure 
export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d  $WIKI_PATH ]]; then
  cd $WIKI_PATH
  if [[ -d cirrus_scripts ]]; then sudo rm -rf cirrus_scripts; fi;
  mkdir cirrus_scripts
  if ! [[ -d cirrus_log ]]; then mkdir cirrus_log ; fi;
  pushd cirrus_scripts
  # generate smaller page sets to work through, try different numbers depending on your experience
  sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
      --skipLinks --indexOnSkip --buildChunks 5000  --conf $WIKI_PATH/LocalSettings.php |
      sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
      split --number r/$N_SCRIPT_FILES
      # did output N_SCRIPT_FILES of FILE to xaa, xab
  # randomly sort commands within a script FILE to xaa, xab
  for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
  popd
  sudo chown www-data:www-data -R cirrus_scripts
  sudo chown www-data:www-data -R cirrus_log
fi

Step 2: check and run generated scripts

  1. check generated scripts in $WIKI_PATH/cirrus_scripts to be looking OK
  2. Just run all the scripts that step 1 made. Best to run them step by step in screen or something and in the directory above
    cd $WIKI_PATH && sudo -u www-data ./cirrus_scripts/xaa.sh
    
    and check the logs

Step 3: generate 2nd part of bootstrapping

export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d  $WIKI_PATH ]]; then
  cd $WIKI_PATH
  pushd cirrus_scripts
  rm *.sh # remove old scripts
  # generate smaller page sets to work through, try different numbers depending on your experience
  sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
      --skipParse --buildChunks 5000  --conf $WIKI_PATH/LocalSettings.php |
      sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
      split --number r/$N_SCRIPT_FILES
      # did output N_SCRIPT_FILES of FILE to xaa, xab
  # randomly sort commands within a script FILE to xaa, xab
  for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
  popd
  sudo chown www-data:www-data -R cirrus_scripts
fi

Step 4: Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load spikes.

If you don't have a good job queue you can try the above but lower the buildChunks parameter significantly and remove the --queue parameter.

Handling ElasticSearch Outages

See README of extension CirrusSearch.

Upgrading

See README of extension CirrusSearch.