Diagnose
- There's one ES Cluster combined with VMS with volumes attached to it as below:
Node |
Role |
es5 |
data |
es6 |
data |
es7 |
data |
es8 |
master |
es9 |
master |
es10 |
master |
- All the 3 data nodes are attached with one cinder volume "/data1" as one of their data folder.
- However due to one Network issue happened last Friday, all of the 3 volumes became "Read-Only File System", which means they're not able to write new documents in.
- Then it impacts the whole ES Cluster to do operation on shards (distribute/replica). So the whole ES Cluster is in red status since some indices having shard issues.
- Then restart the Elasticsearch service on es5, and see exceptions as below:
Caused by: java.io.FileNotFoundException: /data1/instance00/stackstash-elasticsearch/nodes/0/indices/alert_08052014/3/index/_8h5j.fdx (Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:165)
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:384)
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:277)
at org.apache.lucene.store.FileSwitchDirectory.createOutput(FileSwit
- The attached volume is still in "Read-Only" mode.
- Then reboot the data node, and "Read-Only" issue is resolved.
- However Elasticsearch service can’t start on es5 node because "failed to read local state". There may be some broken state files due to read-only issue.
- The same thing happened on es6 and es7.
- So till now, the ES Cluster only has 3 master nodes working normally. While all of the 3 data nodes can not start up Elasticsearch service.
How to recover
- The replica count for each index was set to 2 which means there was a backup of all the data on the remaining 2 data node anyways.
- Shutdown the data node with the corrupt Lucene shards, which is es5.
- Start up Elasticsearch service on es6 and es7 by disable shard-allocation (can do it on any master node):
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable": "none"}}'
- Above steps bring up the data nodes, and can see all is ok. So enable shards allocation as below, and the cluster took a bit of time to turned green:
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable": "all"}}'
- Change the replica count all for indices to 1 (original is 2) in order to speed up recovery and rebalancing.
- Wait for shards to rebalance (we’re still red because there are unassigned primaries).
- Purge the data filesystem on the corrupt data node.
- Restart the corrupt data node (it’s fine now) and allow rebalancing to start.
- Identify remaining primaries that are not automatically being initialised by ES.
- Manually assign remaining primaries to the newly recovered data node.
for index in $(curl -s -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk '{print $1}' | sort -u); do
for shard in $(curl -s -XGET http://localhost:9200/_cat/shards/$index | grep UNASSIGNED | awk '{print $2}' | sort -u); do
curl -s -XPOST 'localhost:9200/_cluster/reroute' -d "{
\"commands\" : [ {
\"allocate\" : {
\"index\" : \"$index\",
\"shard\" : $shard,
\"node\" : \"es5\",
\"allow_primary\" : true
}
}
]
}"
sleep 5
done
- Wait for cluster to turn green.
- Change the replica count for all indices back to 2 (cluster goes yellow).
- Wait for cluster to turn green.
What we get
- Move away from volumes as local disk is plentiful on most VM flavors and instead rely on replicating shards across data nodes at some point.
- During recover, we can set replica to "1" to speed up rebalancing. Then set replica back after recovering.
Reference
- Shard allocation disable/enable