Benchmark Performance Test

This section uses Esrally to test Elasticsearch performance.

Esrally is a tool used to benchmark Elasticsearch performance. It can evaluate the performance of Elasticsearch in different environment configurations and loads. The test loads mean the datasets and test scenarios used in the benchmark test. These loads are used to simulate different real-world scenarios for evaluating Elasticsearch performance. In this section, Esrally tests two datasets: Geonames and Wikipedia.

Geonames: a geographical dataset that contains more than 11 million core information records and millions of alias records. It contains fields such as longitude and latitude coordinates, administrative division code, time zone, and population. The information in the dataset is stored in a structured format.
Wikipedia: a dataset contains a large amount of text content, such as articles, paragraphs, and lists. It includes both structured data, such as titles, authors, and categories, and semi-structured data, such as tables, lists, and links in articles.

Test Prerequisites

For details about Elasticsearch deployment, see Deploying Elasticsearch.

Installing Esrally

Install Esrally.
```
pip3 install esrally
```
After the installation is successful, run the following command to check the version.
```
esrally --version
```
View the available tracks.
```
esrally list tracks
```
This step automatically downloads the Esrally configuration file and saves it to /root/.rally.

Configure rally.ini.

mv /root/.rally /home/elasticsearch/
cd /home/elasticsearch/.rally
vim rally.ini

Modify the configuration file as follows: Change the value of datastore.host to the actual server IP address.

[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = ${CONFIG_DIR}/benchmarks
src.root.dir = ${CONFIG_DIR}/benchmarks/src

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch

[benchmarks]
local.dataset.cache = ${CONFIG_DIR}/benchmarks/data

[reporting]
datastore.type = elasticsearch
#datastore.host = localhost
datastore.host = X.X.X.X
datastore.port = 9200
datastore.secure = False
datastore.user =
datastore.password =


[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = false

[distributions]
release.cache = true

Modifying Wikipedia Track

Modify the cluster health check status.
1. Go to the corresponding directory and edit the default.json file.
```
cd /home/elasticsearch/.rally/benchmarks/tracks/default/wikipedia/operations
vim default.json
```
2. Change the value of wait_for_status to yellow. Elasticsearch is deployed on a single server, and the cluster contains only one node. In this case, the cluster health status is yellow.
```
{
"name": "check-cluster-health",
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "yellow",
},
"retry-until-success": true
},
```

Change the number of shards.

Go to the following directory:

cd /home/elasticsearch/.rally/benchmarks/tracks/default/wikipedia

Edit the wikipedia-full-mapping.json file.
```
vim wikipedia-full-mapping.json
```

Change the default number of shards to 15.

"number_of_shards": {{number_of_shards | default(15)}},

Edit the wikipedia-minimal-mapping.json file.
```
vim wikipedia-minimal-mapping.json
```

Change the default number of shards to 15.

"index.number_of_shards": {{number_of_shards | default(15)}}

Run the following commands to submit the changes:

cd /home/elasticsearch/.rally/benchmarks/tracks/default
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
git add wikipedia/operations/default.json
git add wikipedia/wikipedia-full-mapping.json
git add wikipedia/wikipedia-minimal-mapping.json
git commit -m "Modify"

Running Esrally

Online running

Execute the following command to run Esrally with the Geonames track.

numactl -C 0-15 -m 0 esrally race --pipeline=benchmark-only --track=geonames --target-hosts=localhost:9200

This command downloads the dataset online and runs Esrally.

Offline running

Run the following commands as root to create a directory and set its ownership to the elasticsearch user.

mkdir -p /home/elasticsearch/.rally/benchmarks/tracks/default
chown -R elasticsearch:elasticsearch /home/elasticsearch/.rally

Download the datasets.

You can run the following commands to download the Geonames dataset:

mkdir -p /home/elasticsearch/.rally/benchmarks/data/geonames
cd /home/elasticsearch/.rally/benchmarks/data/geonames
curl -k https://rally-tracks.elastic.co/geonames/documents-2.json.bz2 > documents-2.json.bz2
curl -k https://rally-tracks.elastic.co/geonames/documents-2-1k.json.bz2 > documents-2-1k.json.bz2

You can run the following commands to download the Wikipedia dataset:

mkdir -p /home/elasticsearch/.rally/benchmarks/data/wikipedia
cd /home/elasticsearch/.rally/benchmarks/data/wikipedia
curl -k https://rally-tracks.elastic.co/wikipedia/documents.json.bz2 > documents.json.bz2
curl -k https://rally-tracks.elastic.co/wikipedia/documents-1k.json.bz2 > documents-1k.json.bz2

Run Esrally under the elasticsearch user (the Geonames track is used as an example).
```
numactl -C 0-15 -m 0 esrally race --pipeline=benchmark-only --offline --track=geonames --target-hosts=localhost:9200
```
If another Esrally instance is running and prevents a new run, you can run the following command:
```
numactl -C 0-15 -m 0 esrally race --pipeline=benchmark-only --offline --track=geonames --target-hosts=localhost:9200 --kill-running-processes
```
Start Elasticsearch before running Esrally, and run both under the elasticsearch user (do not run them under the root user).

Running Result Example

Figure 1 shows the run result of the Geonames track.

Figure 1 Run result of the Geonames track

Figure 2 shows the run result of the Wikipedia track

Figure 2 Run result of the Wikipedia track

Key Metrics

**Table 1** Key metrics
Metric	Description
Cumulative indexing time	Time spent indexing all documents. It indicates the efficiency of indexing operations.
Cumulative merge time	Time spent merging segment files. It indicates the efficiency of segment merge operations.
Cumulative refresh time	Time spent flushing index segments into memory. It affects data visibility latency and is an important metric for evaluating real-time search performance.
Cumulative flush time	Time spent flushing in-memory data to drives. It affects data durability and reliability and is a key metric for assessing system stability.
Cumulative merge throttle time	Time during which segment merges are throttled. It indicates the resource consumption of merge operations and impacts overall system performance.

Parent topic: Elasticsearch Description