需求背景
测试索引,自己误操作删除了node.lock文件(因为以前用solr的印象里面这个文件是可以随便删的…幸好是测试索引),造成了索引健康状态变红,重启无法recover这样一个现象。
报错现象
Checkpoint file translog-641.ckp already exists but has corrupted content expected: Checkpoint{offset=67491679, numOps=69873, generation=641, minSeqNo=43662273, maxSeqNo=43732145, globalCheckpoint=43731473, minTranslogGeneration=632} but got: Checkpoint{offset=62260626, numOps=64757, generation=641, minSeqNo=43662273, maxSeqNo=43727029, globalCheckpoint=43727027, minTranslogGeneration=632}
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:252) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.translog.Translog.<init>(Translog.java:179) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:434) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:185) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2160) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2142) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1349) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1304) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.StoreRecovery$$Lambda$2672/1070679391.run(Unknown Source) ~[?:?]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1575) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2028) ~[elasticsearch-6.3.2.jar:6.3.2]
错误日志栈主要是这个在跳,那么分析这个错误日志栈。translog文件记载出错,实际索引的offset和translog的offset不一样了,导致索引健康检查错误。首先最差最差的手段无非是重建索引,但还是希望有别的补救措施,积累一个排错经验。
原因分析
translog文件是用来记载索引写入offset等信息,我的误操作是删除node.lock后并重启。原因我推断是在失去锁的控制后,索引并没有在实际写入数据,但是translog文件却一直没有受到限制继续被一直写入。lucene系的产品都是数据先临时登录到translog在merge到索引这样一个过程。那么我们可以付出这样一个代价:放弃已经写到translog的数据,优先保护索引。也就是删除出错的translog,但这个过程不能直接用rm,需要使用lucene提供的translog工具。
操作命令
./elasticsearch-translog truncate -d /data1/elasticsearch/elasticsearch-6.3.2-test2/data/nodes/0/indices/_mkAKnyeT5uzANk9IBTw0Q/0/translog/