使用弹性搜索作为时间窗口存储的性能问题 [英] Performance issues using Elasticsearch as a time window storage

查看：145 发布时间：2017/8/7 0:16:40 performance indexing elasticsearch

本文介绍了使用弹性搜索作为时间窗口存储的性能问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们正在使用弹性搜索，几乎作为缓存，存储在时间窗口中找到的文档。我们不断插入大量不同大小的文档，然后在ES中使用文本查询和日期过滤器进行搜索，因此当前的线程没有获取已经看到的文档。这样的一个例子：

（（word1 AND word 2）OR（word3 AND word4））AND insertedDate> 1389000

使用TTL功能，我们将弹性搜索中的数据保留30分钟。今天我们至少有3台机器每分钟为批量请求插入新的文档，并且每个机器都可以使用上面的查询进行搜索。

我们有很多麻烦索引和检索这些文档，我们没有获得由ES索引和返回的文档的良好吞吐量。我们每秒不能获得200个文档索引。

我们认为问题在于同时查询，插入和TTL删除。我们不需要保留旧数据的弹性，我们只需要在给定时间弹性的索引的小时间窗口。
我们应该怎么做才能提高表现？

提前感谢

机器类型： / p>

Amazon EC2媒体实例（3.7 GB RAM）

其他信息：

用于构建索引的代码是这样的：
< a href =https://gist.github.com/dggc/6523411 =nofollow> https://gist.github.com/dggc/6523411

我们的elasticsearch.json配置文件：
https：// gist .github.com / dggc / 6523421

编辑

对不起，给你们一些反馈的长时间的延迟。事情在我们公司忙碌，我选择等待更安静的时间，更详细地说明我们如何解决我们的问题。我们仍然需要做一些基准来衡量实际的改进，但关键是我们解决了这个问题：）

首先，我认为索引性能问题是由于部分使用错误导致。正如我之前所说，我们使用Elasticsearch作为一种缓存，在30分钟的时间窗口内查找文档。我们在弹性搜索中查找其内容与某些查询匹配的文档，并且其插入日期在一定范围内。然后，弹性将返回给我们完整的文档json（除索引的内容之外还有大量的数据）。我们的配置对文档json字段进行了弹性索引（除了内容和insertDate字段），我们认为这是索引性能问题的主要原因。

但是，我们也进行了一些修改，正如我们在这里的回答中所提出的那样，我们相信也会提高性能：

我们现在不要使用TTL功能，而是使用通用别名下的两个滚动索引。当索引变老时，我们创建一个新的索引，为其分配别名，并删除旧的。

我们的应用程序执行大量查询每秒。我们相信这会弹性困难，并降低索引性能（因为我们只使用一个节点进行弹性搜索）。我们正在为节点使用10个分片，这导致我们将弹性的每个查询翻译成10个查询，每个分片一个。由于我们可以随时丢弃弹性数据（从而使分片数量的变化对我们来说不是问题），所以我们只将分片数量改为1，大大减少了弹性节点中查询的数量。 p>

我们的索引中有9个映射，每个查询将被触发到特定的映射。在这9个映射中，约90％的文档插入了其中的两个映射。我们为每个映射创建了一个单独的滚动索引，并将另外7个放在同一索引中。

不是一个修改，但是我们安装了SPM来自Sematext的可扩展性能监控），这使我们能够密切监测弹性搜索并学习重要指标，例如查询次数 - > sematext.com/spm/index.html

我们的使用数字相对较小。我们有大约100个文件/秒到达，必须索引，峰值为400个文件/秒。对于搜索，我们每分钟有大约1500次搜索（更改碎片数之前为15000次）。在这些修改之前，我们遇到了这些性能问题，但是不再存在。

解决方案

TTL到基于时间序列的索引

您应考虑使用基于时间序列的索引，而不是TTL功能。鉴于您只关心最近30分钟的文档窗口，请使用基于日期/时间的命名约定，每30分钟创建一个新的索引：ie。 docs-201309120000，docs-201309120030，docs-201309120100，docs-201309120130等（注意命名约定中的30分钟增量）

使用Elasticsearch的索引别名功能（ http://www.elasticsearch.org/guide/reference/api / admin-indices-aliases / ），您可以将最近创建的索引别名 docs ，这样当您进行批量索引时，您始终使用别名 docs ，但他们会写入 docs-201309120130 。

当查询时，您将过滤datetime字段，以确保只返回最近30分钟的文档，并且您需要针对最近创建的2个索引进行查询，以确保您完成30几分钟的文件 - 您可以在此处创建另一个别名来指向两个索引，或者直接对两个索引名称进行查询。

Wi在这种模式下，您没有TTL使用的开销，您可以从过去一个多小时内删除旧的未使用的索引。

有其他方法也可以提高批量索引和查询速度，但是我认为删除TTL将是最大的赢 - 加上你的索引只有有限的数据来过滤/查询，这应该提供一个很好的速度提升。

弹性搜索设置（例如记忆等）

以下是我通常针对运行ES的服务器进行调整的设置 - http://pastebin.com/mNUGQCLY ，请注意，它只适用于1GB VPS，因此您需要进行调整。

节点角色

查看主对vs数据vs'客户端'ES节点类型也可以帮助您 - http://www.elasticsearch.org/guide/reference/modules/node/

索引设置

进行批量插入时，请考虑修改 index.refresh_interval index.merge.policy.merge_factor - 我看到你修改了 refresh_interval 到 5s ，但考虑在批量索引操作之前将其设置为 -1 ，然后回到你想要的间隔。或者，考虑在批量操作完成之后，只需执行一个手动 _refresh API命中，特别是如果您只是仅每分钟进行批量插入 - 在这种情况下的受控环境。

使用 index.merge.policy.merge_factor 将其设置为更高的值减少ES在后台合并的数量，然后在批量操作恢复正常行为后恢复到默认值。通常推荐使用 30 的设置，默认值为 10 。

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time. What should we do to improve our performance?

Thanks in advance

Machine type:

An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

The code used to build the index is something like this: https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file: https://gist.github.com/dggc/6523421

EDIT

Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)

First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.

However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:

We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html

Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.

解决方案

TTL to time-series based indexes

You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)

Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.

When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.

With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.

There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.

Elasticsearch settings (eg. memory, etc.)

Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.

Node roles

Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/

Indexing settings

When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.

With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.

这篇关于使用弹性搜索作为时间窗口存储的性能问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用弹性搜索作为时间窗口存储的性能问题 [英] Performance issues using Elasticsearch as a time window storage

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

使用弹性搜索作为时间窗口存储的性能问题 [英] Performance issues using Elasticsearch as a time window storage

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭