Elastic Devops Cheatsheet

Cheatsheet for Elastic Devops

Nathan Obert

2023-05-01

Cheatsheets

Page content

Introduction

This is a quick cheat sheet from Sematext’s Elastic Devops Cheatsheet

DevOps cheatsheet

Allocation

Allocation awareness

Avoids putting two copies of the same shard on nodes with the same attribute (e.g. rack, availability zone). For example:

node.attr.availability_zone: us-east1  # in elasticsearch.yml

Awareness is enabled at the cluster level:

curl -XPUT localhost:9200/_cluster/settings?pretty -d '{
  "persistent" : {
    "cluster.routing.allocation.awareness.attributes" : "availability_zone"
  }
}'

Allocation filtering

Shards of an index can prefer/avoid nodes with certain attributes. Good for having hot/cold tiers: node.attr.temperature: hot # in elasticsearch.yml

At index creation, you can assign shards to the hot nodes:

curl -XPUT localhost:9200/logs01 -d '{
  "settings": {
    "index.routing.allocation.include.tag": "hot"
  }
}'

Later on, you can change this value to cold move the shards to nodes having temperature set to cold.

Delayed allocation

Avoids the domino effect of relocation when a node is restarted or temporarily unavailable:

curl -XPUT localhost:9200/$INDEX/_settings -d '{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}'

Caches

Query cache

Defaults to 10% of heap:

indices.queries.cache.size: 7%  # in elasticsearch.yml

By default, queries running in the filter context will be cached if they run repeatedly, and only on larger segments. You can override this and cache everything in elasticsearch.yml:

index.queries.cache.everything: true

Request cache

Caches results of aggregations on indices that haven’t changed. Defaults to 1% of heap:

indices.requests.cache.size: 2%

Indexing buffer

A node-level buffer for indexing, before a flush will commit to disk. Defaults to 10% of heap:

indices.memory.index_buffer_size: 5%

Page recycler

Big arrays used by aggregations are put here so they can be reused. Defaults to 10% of heap:

cache.recycler.page.limit.heap: 5%

Field data

The only way to do sorting/aggregations on text fields. Avoid it if possible. If not, limit it through per-request circuit breakers:

indices.breaker.fielddata.limit: 10%

And by limiting the overall size:

indices.fielddata.cache.size: 20%

Merges

Force merge

Might be worth merging indices that don’t change into a handful of big segments:

curl -XPOST localhost:9200/$INDEX/_forcemerge?max_num_segments=5

Segments need to be merged in order to change the compression level, so you can do that before force merging:

curl -XPUT localhost:9200/$INDEX/_settings -d '{
  "index.codec": "best_compression"
}'

Thread pools

size = number of parallel requests, and queue_size = number of waiting requests:

threadpool.search.size: 8
threadpool.search.queue_size: 5000
threadpool.bulk.size: 12
threadpool.bulk.queue_size: 500

Merge policy

There are multiple knobs here. Most importantly: Segments per tier. Defaults to 10. Higher values allow for more segments, giving better indexing throughput at the expense of search latency, disk space, memory and open file handles Max merge at once. Defaults to 10. Lower values lower the impact of merging, but will make the process slower (which can potentially throttle indexing) Max merged segment. Defaults to 5GB. Lower values result in less merges of large segments, but require more merges of small segments, trading spikes for overall load.

curl -XPUT localhost:9200/$INDEX/_settings -d '{
  "index.merge.policy": {
    "segments_per_tier": 50,
    "max_merge_at_once": 50,
    "max_merged_segment": "1gb"
  }
}'

Shrink index

Shrink an index into a new one with less shards (factor of the current number of shards):

curl -XPOST localhost:9200/logs01/_shrink/logs01_shrinked  -d '{
  "settings": {
    "index.number_of_shards": 1
  }
}'

Troubleshooting: get info

Cat health

curl localhost:9200/_cat/health?v

v is for “verbose”, shows column headers. Gives number of nodes, shards (started, initializing, relocating) and cluster color:

All primaries and replicas are up
All primaries are up, but not all replicas
Not all primaries are up

Cat nodes

curl localhost:9200/_cat/nodes?v

Shows figures like load and heap usage of nodes. You can select columns via the help parameter to get other metrics.

_cat/allocation

How many shards are on each node and how much disk space they take (vs free space).

_cat/indices

How big is each index; how many shards and replicas it has.

_cat/shards

How big is each shard and on which node it is. Shows whether a shard is STARTED, UNASSIGNED, INITIALIZING or RELOCATING. You can easily grep though those values when you have many shards

_cat/segments

How big each segment in each shard is (including memory usage). You can filter by index, for example:

curl localhost:9200/_cat/segments/$INDEX?v

If you look at the files, you’ll see different extensions. Most importantly (in terms of memory and storage):

.cfs, .cfe: These are compound segments
.fdt: Stored fields (like _source)
.tim: Term dictionary, used when searching in indexed fields
.doc: Frequency of each term in each document (for scoring)
.pos: Positional information (for phrase searches)
.pay: Payloads, most notably character offsets (for the Postings-based highlighters)
.nvd, .nvm: Field lengths (a.k.a. norms - also used for storing)
.dvd, .dvm: Doc values (used for sorting and aggregations)
.tv?: Term vectors (used for the term-vector-based highlighters)
.dii, .dim: Point values (for geo fields as well as numerics)

More information can be found here (for Elasticsearch 5.x, which uses Lucene 6.x, you may need to change the version): https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/codecs/lucene60/package-summary.html

_cat/pending_tasks

In-progress operations in your cluster. You’d typically catch long-running ones (e.g. snapshot, force merge) or the ones that get queued up when the cluster is in trouble and the master gets overloaded (e.g. lots of mapping/cluster state updates). _cat/thread_pool How many threads are active (working on) searches, bulk indexing and so on. You can also see how many are enqueued (queue) compared to the queue.size and how many were rejected (usually because the queue was full).

_cat/fielddata

How much heap field data (the in memory equivalent of doc values) takes. Per field, per node.

Nodes stats

curl localhost:9200/_nodes/stats?pretty

Gives back statistics of all nodes in the cluster. You can filter nodes, too, like _nodes/_local/stats just for the current node. Relevant metrics include:

How much time was spent in queries, fetches, indexing, merging, etc
How much memory current segments take, broken by type (e.g. term dictionary, doc values) which is a good indicator of the live set
Current and maximum amount of heap usage per pool. Good indicator of

Nodes hot threads

curl localhost:9200/_nodes/hot_threads

Tells you what’s keeping Elasticsearch busy. Add type=wait or type=block to see what’s keeping it from being busy. You can also filter nodes like _nodes/_local/hot_threads

OS stats

top, iotop, dstat, iostat help figure out what the bottleneck is. Usually:

Aggregations are CPU-intensive and memory-intensive. The last part may translate into high GC (check the logs for longer GC events)
Full-text search (without aggregations) is IO latency sensitive
Indexing (especially merging) is CPU intensive and IO throughput intensive
Snapshots, replication and replication are network and disk intensive

Cluster allocation explain

Shows all the decisions that make a particular shard not to be allocated on different nodes:

curl localhost:9200/_cluster/allocation/explain?pretty -d'{
  "index": "INDEX_NAME",
  "shard": 0,
  "primary": true
}'

Also accepts the node name as a node value in the body to show the explanation only for it.

Indices shard stores

curl localhost:9200/$INDEX/_shard_stores?pretty

Returns the last exception that occurred while opening shards of this index.

Troubleshooting actions

Total shards per node

How many shards an index can have on each node (good for force-balancing the cluster):

curl -XPUT localhost:9200/$INDEX/_settings -d '{
  "index.routing.allocation.total_shards_per_node": 2
}'

Disk allocation thresholds

Prevents nodes from running out of disk.

Low watermark: when to stop allocating new shards. High watermark: when to relocate existing shards.

curl -XPUT localhost:9200/_cluster/settings -d '{
    "persistent" : {
        "cluster.routing.allocation.disk.watermark.low" : "70%",
        "cluster.routing.allocation.disk.watermark.high" : "85%"
    }
}'

Shard reroute (allocate, move and cancel)

Allows you to try and allocate a shard manually, or cancel a replication/relocation, or to move a shard:

curl -XPOST localhost:9200/_cluster/reroute -d '{
    "commands" : [ {
        "move" :
            {
              "index" : "INDEX_NAME", "shard" : SHARD_NUMBER,
              "from_node" : "SOURCE_NODE", "to_node" : "DESTINATION_NODE"
            }
        }
    ]
}'

Concurrent replications, relocations and bandwidth

How many shards can be replicated from each node:

curl -XPUT localhost:9200/_cluster/settings?pretty -d '{
  "persistent" : {
    "cluster.routing.allocation.node_concurrent_recoveries": 2
  }
}'

How many shards can move around, cluster-wide:

curl -XPUT localhost:9200/_cluster/settings?pretty -d '{
  "persistent" : {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}'

How much bandwidth can recovery/rebalancing take:

curl -XPUT localhost:9200/_cluster/settings?pretty -d '{
  "persistent" : {
    "indices.recovery.max_bytes_per_sec": "20mb"
  }
}'

Transaction log settings

Trade durability for performance (less IOPS):

curl -XPUT localhost:9200/$INDEX/_settings -d '{
  "index.translog": {
    "index.translog.durability": "async"
  }
}'

GC tuning

If survivor space is mostly full, you can increase it by lowering -XX:SurvivorRatio in jvm.options (default is 8 on most platforms).

If the whole young generation (survivor + eden) is mostly full, you can increase it via -XX:NewSize.

On large heaps (>30GB, usually you’d want to stay under 30GB to get compressed pointers, but 60-90GB may be needed on some big boxes), using G1 instead of CMS should help. To do that, replace:

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

With:

-XX:+UseG1GC

Clear caches

Quick way to free some heap:

curl -XPOST localhost:9200/_cache/clear