Difference between revisions of "Elasticsearch"
m (→TODOS ElasticSearch: +findings) |
m (→Bootstrapping or Indexing Large Wikis {{anchor|Bootstrapping large wikis}}) |
||
Line 288: | Line 288: | ||
The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between. | The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between. | ||
− | Help info of <code>php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --help --conf LocalSettings.php</code> gives (in REL1_26): | + | <div class="mw-collapsible mw-collapsed" style="border-left:1px dotted black;padding-left:1em;"> |
+ | Help info of <code>php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --help --conf LocalSettings.php</code> gives (in REL1_26) … | ||
+ | <div class="mw-collapsible-content" style="padding-left:1em;"> | ||
Script specific parameters: | Script specific parameters: | ||
; --buildChunks: Instead of running the script spit out commands that can be farmed out to different processes or machines to rebuild the index. Works with fromId and toId, not from and to. If specified as a number then chunks no larger than that size are spat out. If specified as a number followed by the word "total" without a space between them then that many chunks will be spat out sized to cover the entire wiki. | ; --buildChunks: Instead of running the script spit out commands that can be farmed out to different processes or machines to rebuild the index. Works with fromId and toId, not from and to. If specified as a number then chunks no larger than that size are spat out. If specified as a number followed by the word "total" without a space between them then that many chunks will be spat out sized to cover the entire wiki. | ||
Line 304: | Line 306: | ||
; --to: Stop date of reindex in YYYY-mm-ddTHH:mm:ssZ. Defaults to now. | ; --to: Stop date of reindex in YYYY-mm-ddTHH:mm:ssZ. Defaults to now. | ||
; --toId: Stop indexing at a specific page_id. Not useful with <code>--deletes</code> or <code>--from</code> or <code>--to</code>. | ; --toId: Stop indexing at a specific page_id. Not useful with <code>--deletes</code> or <code>--from</code> or <code>--to</code>. | ||
+ | </div> | ||
+ | </div> | ||
Step 1: | Step 1: |
Revision as of 13:56, 1 August 2017
Contents
Requirements
Requirement | Extension:CirrusSearch Extension:Elastica |
ElasticSearch |
---|---|---|
java, composer | < REL1_28 | 1.x |
java, composer | > REL1_28 | 2.x |
Details:
- Extension:CirrusSearch
require ElasticSearch 1.x (below version REL1_28)
require ElasticSearch 2.x since REL1_28, REL1_29, es2.x, es5, master, wmf/1.30.0-wmf.10, wmf/1.30.0-wmf.5, wmf/1.30.0-wmf.6, wmf/1.30.0-wmf.7, wmf/1.30.0-wmf.9 - Extension:Elastica, composer
Installation ElasticSearch
It is not recommended here in this case to use a package manager for installing elasticsearch, because MediaWiki Extension:CirrusSearch depends on specific versions of it and upgrading may cause the search not to work properly.
ElasticSearch Version 1.7
cd ~
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.6.deb
sudo dpkg --install elasticsearch-1.7.6.deb
# Unpacking elasticsearch (from elasticsearch-1.7.6.deb) ...
# Setting up elasticsearch (1.7.6) ...
Add plugin elasticsearch-mapper-attachments for searching files. You need a compatible version to the elasticsearch version, read documentation at https://github.com/elastic/elasticsearch-mapper-attachments
sudo /usr/share/elasticsearch/bin/plugin install elasticsearch/elasticsearch-mapper-attachments/2.7.1
ElasticSearch Version 2.4
Go to https://www.elastic.co/de/downloads/past-releases, select product and version of 2.4-something
cd ~
wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.5/elasticsearch-2.4.5.deb
# https://www.elastic.co/guide/en/elasticsearch/reference/2.4/release-notes-2.4.5.html
sudo dpkg --install elasticsearch-2.4.5.deb
# Unpacking elasticsearch (from elasticsearch-2.4.5.deb) ...
# Creating elasticsearch group... OK
# Creating elasticsearch user... OK
# Setting up elasticsearch (2.4.5) ...
Plugins see https://www.elastic.co/guide/en/elasticsearch/plugins/2.4/index.html. Add elasticsearch plugin for file search:
sudo /usr/share/elasticsearch/bin/plugin install mapper-attachments # deprecated in elasticsearch 5+
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# @ WARNING: plugin requires additional permissions @
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# * java.lang.RuntimePermission getClassLoader
# * java.lang.reflect.ReflectPermission suppressAccessChecks
# * java.security.SecurityPermission createAccessControlContext
# * java.security.SecurityPermission insertProvider
# * java.security.SecurityPermission insertProvider.BC
# * java.security.SecurityPermission putProviderProperty.BC
# See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
# for descriptions of what these permissions allow and the associated risks.
#
# Continue with installation? [y/N] y
# Installed mapper-attachments into /usr/share/elasticsearch/plugins/mapper-attachments
TODOS ElasticSearch
PDF text indexing works with mapper-attachments 1.7 (and Extension: PdfHandler) but not for DOC files. Solution in 1.7 can be to use copy_to
mapping and
// LocalSettings.php
$wgHooks['CirrusSearchMappingConfig'][] = function ( array &$config, $mappingConfigBuilder ) {
// ... add mapping here
$config['page']['properties']['file_attachment'] = [
'type' => 'attachment',
"fields" => [
"content" => [
"type" => "string",
"copy_to" => ["all", "file_text"],
]
]
];
};
But in mapper-attachment 2.4 it is more complicated and not yet solved:
- Follow perhaps Topic on Extension talk:CirrusSearch “Search inside uploaded documents” (www.mediawiki.org).
- Topic Integrate mapper-attachment-plugin to Extension:CirrusSearch? (www.mediawiki.org)
Modifications for first install
Check installed files of the package:
dpkg-query -L elasticsearch # check installed files of the package
# /etc/init.d/elasticsearch - system service, also configuration and system variables
# /usr/lib/sysctl.d/elasticsearch.conf - contains system service configuration
# /etc/elasticsearch/ - contains elasticsearch configuration see elasticsearch.yml
# /usr/share/elasticsearch/bin/plugin - binary for plugin management
# /usr/share/elasticsearch/NOTICE.txt - documentation
# /usr/share/elasticsearch/README.textile - documentation
# /var/lib/elasticsearch - default data directory (set in /etc/init.d/elasticsearch)
Modify data directory (set in /etc/init.d/elasticsearch):
if ! [[ -d /mnt/storage/var-lib-elasticsearch ]]; then
sudo mkdir /mnt/storage/var-lib-elasticsearch;
sudo mv /var/lib/elasticsearch /var/lib/elasticsearch-bak
cd /var/lib # make elasticsearch store data to storage device, i.e. symlink it
sudo ln -s --force /mnt/storage/var-lib-elasticsearch elasticsearch
sudo chown elasticsearch:elasticsearch -R /mnt/storage/var-lib-elasticsearch
sudo chown elasticsearch:elasticsearch -R elasticsearch # set symlink
# sudo rm --interactive /var/lib/elasticsearch-bak
fi
ElasticSearch Configurations
Read and consult:
- In general (ES_HEAP_SIZE)
- https://www.elastic.co/blog/a-heap-of-trouble
- ElasticSearch Version 1.7
- ElasticSearch Version 2.4
Set some configuration variables
sudo vi /etc/init.d/elasticsearch # set ES_HEAP_SIZE not clear how much
sudo vi /etc/elasticsearch/elasticsearch.yml
# set cluster.name: biowikifarm-prod
# affects also name of log file in /var/log/elasticsearch/, e.g. biowikifarm-prod.log
# set node.name: ${HOSTNAME}
Check for System Service
SysV init vs systemd: ElasticSearch is not started automatically after installation. How to start and stop ElasticSearch depends on whether your system uses SysV init or systemd (used by newer distributions). You can tell which is being used by running this command:
ps --pid 1 # init
Running ElasticSearch with SysV init: Use the update-rc.d command to configure ElasticSearch to start automatically when the system boots up:
# update-rc.d name-of-service defaults Start-Sequence-Number Kill-Sequence-Number
sudo update-rc.d elasticsearch defaults 95 10
ElasticSearch can be started and stopped using the service command:
sudo -i service elasticsearch status
sudo -i service elasticsearch start
sudo -i service elasticsearch stop
If ElasticSearch fails to start for any reason, it will print the reason for failure to STDOUT. Log files can be found by default in /var/log/elasticsearch/
Set up MediaWiki Extensions and Configurations
Read
- https://www.mediawiki.org/wiki/Extension:CirrusSearch#Installation
- https://www.mediawiki.org/wiki/Extension:Elastica
-
extensions/CirrusSearch/README
and follow that instructions carefully to the letter
Install or upgrade extensions via bash script create_or_upgrade_shared-wiki-extensions.sh to the version you need.
Then set up Extension:Elastica once:
cd /usr/share/mediawiki26/extensions-rich-features/Elastica
sudo -u www-data /usr/local/bin/composer.phar install --no-dev
cd /usr/share/mediawiki26/extensions-simple-features/Elastica
sudo -u www-data /usr/local/bin/composer.phar install --no-dev
Indexing Pages
(0) Include extensions first:
require_once( "$IP/extensions/Elastica/Elastica.php" );
require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" );
(1) Set $wgDisableSearchUpdate
:
$wgDisableSearchUpdate = true;
(2) Set the variables you need to set: there are other $wgCirrusSearch
variables that you might want to change from their defaults:
- read
/extensions/CirrusSearch/CirrusSearch.php
- dig perhaps through MediaWiki settings (https://noc.wikimedia.org/conf/), e.g. https://noc.wikimedia.org/conf/highlight.php?file=CirrusSearch-common.php, https://noc.wikimedia.org/conf/highlight.php?file=CirrusSearch-production.php
(3) Set up elasticsearch properly and make it running.
(4) Generate CirrusSearch configurations and run updateSearchIndexConfig.php
in preparing indexing for a particular wiki. It checks basic things, versions, creates elasticsearch instances to write to etc. but does not yet index any Wiki pages.
WIKI_PATH=/var/www/testwiki2/
# WIKI_PATH=/var/www/v-species/s/
# WIKI_PATH=/var/www/v-species/o/
cd $WIKI_PATH
sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --conf LocalSettings.php
(5) Now remove $wgDisableSearchUpdate = true;
from LocalSettings.php. Updates should start heading to ElasticSearch.
# LocalSettings.php switch to false or remove $wgDisableSearchUpdate
$wgDisableSearchUpdate = false;
Next bootstrap the search index by running and indexing the actual Wiki pages. Note this will take some resources, especially on large wikis:
WIKI_PATH=/var/www/testwiki2/
cd $WIKI_PATH
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php # takes a long time
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php
Note that this can take some time. For large wikis (more about 10,000 pages) read “Bootstrapping large wikis” below.
(6) Once that is complete add this to LocalSettings.php
to funnel queries to ElasticSearch:
# LocalSettings.php switch Searchtype
$wgSearchType = 'CirrusSearch';
Note: removing $wgSearchType = 'CirrusSearch';
in LocalSettings.php
switches back to the default search.
Re-Indexing ElasticSearch
Some setting changes require a complete full reindexing. You may read official documentation and recommendation first and then follow these steps. If the re-indexing is not done as the README describes it can cause RuntimeExceptions.
(1) Repeat steps from above
# LocalSettings.php
$wgDisableSearchUpdate = true;
# $wgSearchType = 'CirrusSearch'; # uncomment switches back to default search
(2) Now run this script to generate your elasticsearch index for a particular wiki. Checks basic things, version, creates elasticsearch instances to write to etc. but does not yet index the Wiki pages
WIKI_PATH=/var/www/testwiki2/
# WIKI_PATH=/var/www/v-species/s/
# WIKI_PATH=/var/www/v-species/o/
cd $WIKI_PATH
sudo -u www-data php ./extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --conf LocalSettings.php
(3) Now remove $wgDisableSearchUpdate = true
from LocalSettings.php. Updates should start heading to ElasticSearch.
# LocalSettings.php switch to false or remove it:
$wgDisableSearchUpdate = false;
(4) Repeat bootstrapping the search index by running and indexing the actual Wiki pages. This takes a considerable amount of CPU resources. One step at a time(!) follow the process on the terminal screen:
WIKI_PATH=/var/www/testwiki2/
cd $WIKI_PATH
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --conf LocalSettings.php # takes a long time
sudo -u www-data php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --conf LocalSettings.php # faster
If you have a large Wiki you can only break down the indexing to smaller job scripts (see Bootstrapping or Indexing Large Wikis)
(5) Switch on CirrusSearch again:
# LocalSettings.php
$wgDisableSearchUpdate = false;
$wgSearchType = 'CirrusSearch'; # use cirrus search again
Bootstrapping or Indexing Large Wikis
(modified from README of extension:CirrusSearch REL1_26)
Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the process into multiple processes. Don't worry too much about the database during this process. It can generally handle more indexing processes then you are likely to be able to spawn.
General strategy:
- Make sure you have a good job queue setup. It'll be doing most of the work. In fact, Cirrus won't work well on large wikis without it.
- Generate scripts to add all the pages without link counts to the index.
- Execute them any way you like. (on biowikifarm for large wikis I recommend a time, when there is less traffic)
- Generate scripts to count all the links.
- Execute them any way you like.
The procedure for large Wikis, greater than 10.000 per haps, follows the documentation of README of extension:CirrusSearch (this is not an automated procedure and you may execute it step by step manually checking logs and speed of the process in between.
Help info of php ./extensions/CirrusSearch/maintenance/forceSearchIndex.php --help --conf LocalSettings.php
gives (in REL1_26) …
Script specific parameters:
- --buildChunks
- Instead of running the script spit out commands that can be farmed out to different processes or machines to rebuild the index. Works with fromId and toId, not from and to. If specified as a number then chunks no larger than that size are spat out. If specified as a number followed by the word "total" without a space between them then that many chunks will be spat out sized to cover the entire wiki.
- --deletes
- If this is set then just index deletes, not updates or creates.
- --from
- Start date of reindex in YYYY-mm-ddTHH:mm:ssZ (exc. Defaults to 0 epoch.
- --fromId
- Start indexing at a specific page_id. Not useful with
--deletes
. - --indexOnSkip
- When skipping either parsing or links send the document as an index. This replaces the contents of the index for that entry with the entry built from a skipped process.Without this if the entry does not exist then it will be skipped entirely. Only set this when running the first pass of building the index. Otherwise, don’t tempt fate by indexing half complete documents.
- --limit
- Maximum number of pages to process before exiting the script. Default to unlimited.
- --maxJobs
- If there are more than this many index jobs in the queue then pause before adding more. This is only checked every 3 seconds. Not meaningful without
--queue
. - --namespace
- Only index pages in this given namespace
- --pauseForJobs
- If paused adding jobs then wait for there to be less than this many before starting again. Defaults to the value specified for
--maxJobs
. Not meaningful without--queue
. - --queue
- Rather than perform the indexes in process add them to the job queue. Ignored for delete.
- --skipLinks
- Skip looking for links to the page (counting and finding redirects). Use this with
--indexOnSkip
for the first half of the two phase index build. - --skipParse
- Skip parsing the page. This is really only good for running the second half of the two phase index build. If this is specified then the default batch size is actually 50.
- --to
- Stop date of reindex in YYYY-mm-ddTHH:mm:ssZ. Defaults to now.
- --toId
- Stop indexing at a specific page_id. Not useful with
--deletes
or--from
or--to
.
Step 1:
- generate 1st scripts for a wiki bootstrapping procedure
- decide for a reasonable amount of part-jobs, set
--maxJobs
,--buildChunks
accordingly in the range of 5,000 to 10,000 (larger ones caused the queue to halt in between on biowikifarm)
export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d $WIKI_PATH ]]; then
cd $WIKI_PATH
if [[ -d cirrus_scripts ]]; then sudo rm -rf cirrus_scripts; fi;
mkdir cirrus_scripts
if ! [[ -d cirrus_log ]]; then mkdir cirrus_log ; fi;
pushd cirrus_scripts
# generate smaller page sets to work through, try different numbers depending on your experience
sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
--skipLinks --indexOnSkip --buildChunks 5000 --conf $WIKI_PATH/LocalSettings.php |
sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
split --number r/$N_SCRIPT_FILES
# did output N_SCRIPT_FILES of FILE to xaa, xab
# randomly sort commands within a script FILE to xaa, xab
for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
popd
sudo chown www-data:www-data -R cirrus_scripts
sudo chown www-data:www-data -R cirrus_log
fi
Step 2: check and run generated scripts
- check generated scripts in
$WIKI_PATH/cirrus_scripts
to be looking OK - Just run all the scripts that step 1 made. Best to run them step by step in screen or something and in the directory above and check the logs
cd $WIKI_PATH && sudo -u www-data ./cirrus_scripts/xaa.sh
Step 3: generate 2nd part of bootstrapping
export N_SCRIPT_FILES=10 # or whatever number you want
wiki="openmedia"
WIKI_PATH=/var/www/v-species/o
if [[ -d $WIKI_PATH ]]; then
cd $WIKI_PATH
pushd cirrus_scripts
rm *.sh # remove old scripts
# generate smaller page sets to work through, try different numbers depending on your experience
sudo -u www-data php $WIKI_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 5000 --pauseForJobs 1000 \
--skipParse --buildChunks 5000 --conf $WIKI_PATH/LocalSettings.php |
sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' |
split --number r/$N_SCRIPT_FILES
# did output N_SCRIPT_FILES of FILE to xaa, xab
# randomly sort commands within a script FILE to xaa, xab
for script in x*; do sort --random-sort $script > $script.sh && rm $script; done
popd
sudo chown www-data:www-data -R cirrus_scripts
fi
Step 4: Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load spikes.
If you don't have a good job queue you can try the above but lower the --buildChunks
parameter significantly and
remove the --queue
parameter.
Handling ElasticSearch Outages
See README of extension CirrusSearch.
Upgrading
See README of extension CirrusSearch.