AWS中的Elasticsearch快照失败,阻止升级 [英] Elasticsearch Snapshot Failing in AWS, preventing upgrade

查看:79
本文介绍了AWS中的Elasticsearch快照失败,阻止升级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Elasticsearch 中的增量快照失败了.我什么也没碰,似乎什么都没有改变,无法弄清楚出什么问题了.

My incremental Snapshots in Elasticsearch are now failing. I didn't touch anything, nothing seems to have changed, can't figure out what is wrong.

我通过执行以下操作来检查我的快照: GET _cat/snapshots/cs-automated?v& s = id 并查找失败的快照的详细信息:

I checked my Snapshots by doing: GET _cat/snapshots/cs-automated?v&s=id and finding the details of a failed one:

获取_snapshot/cs-automated/adssd....

哪个显示了此堆栈跟踪:

Which showed this stacktrace:

java.nio.file.NoSuchFileException: Blob object [YI-....] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID: zh1C6C0eRy....)
    at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:92)
    at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:72)
    at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:100)
    at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:147)
    at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:133)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:2381)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1851)
    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:505)
    at org.elasticsearch.snapshots.SnapshotShardsService.access$600(SnapshotShardsService.java:114)
    at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:386)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractPrioritizedRunnable.doRun(ThreadContext.java:763)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

不知道该如何解决,我现在可以更长久地升级索引,我检查了以下页面:

Don't know how to resolve this I can now longer upgrade my index, I checked this page: Resolve snapshot error in .. but still struggling. I've tried deleting a whole bunch of indicies. I may try restoring an old Snapshot. I also delete some .opendis.. indicies used for tracking ILM and a .lock index as well but nothing is helping. Very annoying.

根据评论的要求:

GET /_cat/repositories?v
id           type
cs-automated   s3

GET/_cat/snapshots/cs-automated 生成所有快照,这些快照的状态均为 PARTIAL :

GET /_cat/snapshots/cs-automated produces heaps of Snapshots all of which are PARTIAL in their status:

2020-09-08t01-12-44.ea93d140-7dba-4dcc-98b5-180e7b9efbfa PARTIAL 1599527564 01:12:44 1599527577 01:12:57 13.4s  84 177 52 229
2021-02-04t08-55-22.8691e3aa-4127-483d-8400-ce89bbbc7ea4 PARTIAL 1612428922 08:55:22 1612428957 08:55:57   35s 208 793 31 824
2021-02-04t09-55-16.53444082-a47b-4739-8ff9-f51ec038cda9 PARTIAL 1612432516 09:55:16 1612432552 09:55:52 35.6s 208 793 31 824
2021-02-04t10-55-30.6bf0472f-5a6c-4ecf-94ba-a1cf345ee5b9 PARTIAL 1612436130 10:55:30 1612436167 10:56:07 37.6s 208 793 31 824
2021-02-04t11-......

推荐答案

快照以PARTIAL状态结束的原因是由于缺少S3存储库YI- ..文件中的某些问题.哪些是明显的存储库损坏案例.

The reason for snapshot to end in PARTIAL state is that because of some issue in S3 repository YI-.... file is missing. Which is clear case of repository corruption.

java.nio.file.NoSuchFileException:找不到Blob对象[YI -....]:指定的密钥不存在.(服务:Amazon S3;状态代码:404;错误代码:NoSuchKey;要求编号:21 ...;S3扩展请求ID:zh1C6C0eRy ....)

java.nio.file.NoSuchFileException: Blob object [YI-....] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID: zh1C6C0eRy....)

当集群负载很重(JVM大于80%或CPU利用率大于80%)并且很少有节点退出集群时,会观察到这种存储库损坏.

This kind of repository corruption is observed when cluster is heavily loaded (JVM > 80% or CPU utilization >80%) and few of nodes drops out of cluster.

解决该问题的一种方法是删除所有引用"YI -..."引用的索引的快照.这将清理索引为YI -....的S3快照文件,现在当您拍摄新快照时,一切都会重新开始.

One way to fix the issue is to delete all the snapshots that refers to index referred by "YI-....". This will cleanup S3 snapshot files of index YI-.... and now when you take new snapshot everything starts afresh.

为了安全起见,我建议联系AWS支持以修复这种类型的存储库损坏.

To be on safer side, I would recommend to contact AWS support to fix this type of repository corruption.

Elasticsearch参考相似的问题已在Elasticsearch 7.8及更高版本中修复: https://github.com/elastic/elasticsearch/issues/57198

Elasticsearch reference similar issue fixed in elasticsearch version 7.8 and above : https://github.com/elastic/elasticsearch/issues/57198

这篇关于AWS中的Elasticsearch快照失败,阻止升级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆