大型集群的弹性搜索设置 [英] ElasticSearch setup for a large cluster with heavy aggregations

查看:162
本文介绍了大型集群的弹性搜索设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文和当前状态



我们将集群从Cassandra迁移到完整的ElasticSearch集群。我们正在索引文档,平均为〜250-300 docs per seconds 。在ElasticSearch 1.2.0中,每天代表〜8Go。

  {
generic :
{
id:twi471943355505459200,
type:twitter,
title:RT @YukBerhijabb:生命是选择 - https ://m.facebook.com/story.php?story_fbid = 637864496306297& id = 100002482564531& refid = 17,
content:RT @YukBerhijabb:The Life is Choice - https:// m。 facebook.com/story.php?story_fbid=637864496306297&id=100002482564531&refid=17,
source:< a href = \https://twitter.com/download/android\\ \\rel = \nofollow\> Android for Android< / a>,
geo:null,
link:http://twitter.com/rosi_sifah / status / 471943355505459200,
lang:en,
created_at:1401355038000,
author:{
username:rosi_sifah,
name:Rosifah,
id:537798506,
头像:http://pbs.twimg.com/profile_images/458917673456238592/Im22zoIV_normal.jpeg,
link:http://twitter.com/rosi_sifah
}
},
twitter:{
// tweet JSON
}
}

我们的用户在我们的SQL数据库中保存请求,当他们要求他们的仪表板时,我们希望使用他们的查询(从数据库检索)请求我们的ES集群,并且在它使用新的ES聚合框架。



每个仪表板都显示为明确的用户选择的日期范围,因此我们始终使用

 range:{
generic.created_at:{
from:1401000000000 ,
to:1401029019706
}
}

使用ES查询。



我们以这种方式指定了 _routing

 _路由:{
required:true,
path:generic.id
},
pre>

_id

} 

约5天内我们存储了6700万个约40Go)。我们已经了解了按日分割索引的良好做法。所以现在我们的索引是按日分割的([index-name] - [YYYY-MM-DD])。



目前每个索引都有5个碎片和1个副本,有一个由3台机器组成的集群,每个机器有8个内核,16个RAM和8个HDD。我们计划使用另一台机器作为网关(8个内核,16个RAM,1个HDD)。



默认情况下,除了集群配置之外,我们已经删除了ES配置。



问题




  1. 对于我们要索引的每个文档,我们明确说明索引到
    使用。目前我们使用当天的日期。我们是否应该使用
    的日期来防止热点?因为目前它
    意味着从不同的日子(在
    created_at内指定)的文件可以生活在当天的相同索引。

  2. 是5分或太多)每天提供21万件文件?

  3. 如果我们希望在不到1秒钟内处理所有的聚合查询,我们应该设置多少个副本?

  4. 我们应该更改路由?由于我们不提前知道在针对集群进行的每个请求的聚合之前将处理哪些文档(因为查询是用户定义的)

  5. 什么样的硬件许多机器,什么配置)我们应该放在这个集群中以支持6个月的文件?

[更新] strong>



以下是一些查询示例:



一个字云

  GET idx-2014-05-01 / stream / _search?search_type = count 
{
查询:{
bool:{
must:[{
query_string:{
query:(generic.lang:fr OR generic.lang :$ {
generic.created_at:{
from:1401000000000,
to:1401029019706
}
}}
]
}
},
aggs:{
words:{
terms:{
field:generic.content,
size:40
}
}
}
}

直方图

  GET idx-2014-05-01 / stream / _search?search_type = count 
{
query {
bool:{
must:[{
query_string:{
query:generic.content:apple
}} ,{
range:{
generic.created_at:{
from:1401000000000,
to:1401029019706
}



aggs:{
volume:{
date_histogram:{
字段:generic.created_at,
间隔:分钟
}
}
}
}

必须使用的语言

  GET idx-2014-05-0 1 / stream / _search?search_type = count 
{
query:{
bool:{
must:[{
query_string {
query:(generic.lang:fr OR generic.lang:en)AND(generic.content:javascript)
}},{
range:{
generic.created_at:{
from:1401000000000,
to:1401029019706
}
}}
]
}
},
aggs:{
top_source:{
terms:{
field:generic.lang
}
}
}
}

解决方案

让我先来看一下我的所有答案/意见,尽可能的尝试,尽可能自己测试这些场景。虽然Elasticsearch是非常可扩展的,但是有许多权衡受到文档大小和类型,摄取和查询卷,硬件和操作系统的严重影响。虽然有很多错误的答案,但很少有一个正确的答案。



我将这个回应放在几个活跃的群集上(目前)约有五十万加上一些最近的基准测试,我们执行了大约4倍的体积(基准测试每天摄入的大约80M文档)。



1)首先你不是当您甚至使用5个碎片和1个每个分片的索引时,创建了3个节点的热点。弹性搜索将每个副本从主要分离到不同的节点,一般会尝试平衡分片的负载。默认情况下,Elasticsearch将在标识符上哈希来选取分片进行索引(然后将其复制到副本)。即使使用路由,如果您具有每天创建大量文档的单个ID(这是您的索引的跨度),则只会有热点问题。即使如此,除非这些ID占总体积的很大比例,否则这不会是一个问题,而且只有少数几个可以在只有1个或2个分片上聚集。



只有您可以根据您的使用模式确定 - 我建议您对现有数据集进行一些分析,以查找过大的浓度和对可能的查询的分析。



我有一个更大的问题是您的查询的性质。您没有显示完整查询和完整模式(我看到generic.id引用但不在文档模式中,您的查询显示在一个时间范围内拉出每个单个文档 - 是否正确?当您的查询被绑定在用于路由的字段上的完全匹配时,用于索引的自定义路由最有用。所以,如果我有一个索引,其中的每个人的文档,我的查询模式只是在单个查询中检索单个用户的文档,那么用户id的自定义路由将是非常有用的,以提高查询性能和减少整体集群负载。 / p>

要考虑的第二件事是摄取与查询的总体平衡。您每天摄取的文件超过20M,每天执行多少个查询?如果该数字是<<<摄入率您可能想要通过自定义路线的需要。此外,如果查询性能好或很好,您可能不想增加额外的复杂性。



最后,通过摄取日期与created_at进行索引。我们也在努力,因为我们在收到新的文件方面有些滞后。现在,我们已经通过摄取日期进行存储,因为它更容易管理,而不是一次查询多个索引的大问题,特别是如果您在1周,2周,1个月,2个月等自动创建别名。一个更大的问题是分配是什么 - 如果您有几个星期或几个月之后的文档,或许您想要通过created_at更改为索引,但这将需要保持这些索引在线和开放一段时间。



我们目前每天使用几个文档索引,基本上是 - 格式。实际上这意味着每天5个指标。这使我们能够更有选择地将数据移入和移出集群。不是一个建议,只是我们学到的东西对我们有用。



2)这是伟大的思考ES - 创建一个新的索引,每天你可以随着时间的推移调整,以增加每个索引的碎片数量。虽然您无法更改现有索引,但您每天都在创建一个新的索引,您可以根据实际的生产分析做出决定。您当然希望观看数字,并准备增加碎片数量/如果您每天摄入量增加。这不是最简单的权衡 - 每个碎片都是一个可能有多个文件的Lucene实例。每个索引更多的碎片不是免费的,因为随着时间的推移而变化。鉴于您的使用案例为6个月,超过1800个碎片在3个节点(182天x 5个初级和5个复制品每天)打开。每个碎片可能会打开多个文件。我们已经发现一些级别的开销和对我们节点上的资源使用的影响,因为集群中的总分片数量增加到这些范围。您的里程可能会有所不同,但是当您计划一次保留182个索引(6个月)时,我会小心增加每个索引的碎片数量 - 这是一个倍数。如果您对默认分页数进行任何更改,我一定会提前进行基准测试。



3)任何其他人都无法提前预测查询性能为你。它基于整体集群负载,查询复杂性,查询频率,硬件等。它对您的环境非常具体。你将要对此进行基准测试。就个人而言,您已经加载了数据,我将使用ES快照并将其恢复到测试环境中。尝试使用默认的1个副本,看看它是怎么回事。添加副本碎片非常适合数据冗余,并可以帮助分散整个群集中的查询,但价格相当陡峭 - 存储增加50%,每个附加复制分片将为其运行的节点带来额外的摄取成本。如果您需要冗余,并且可以节省容量,那么这是非常好的,如果您缺乏足够的查询量来真正利用该功能,则不会太大。



4)您的问题不完整(以我们永远不会结尾),所以我无法直接回答 - 但是一个更大的问题是为什么要自定义路由开始?当然,它可以具有很好的性能优势,但只有您可以通过您用于路由的字段分割一组文档才有用。如果是这种情况,您的示例数据和部分查询不完全清楚。个人来说,我会测试它没有任何自定义路由,然后尝试相同的,看看它是否有重大影响。



5)另一个需要一些工作的问题在你身上您需要跟踪(至少)JVM堆使用情况,整体内存和CPU使用情况,磁盘使用情况和磁盘io活动随时间的变化。我们这样做,阈值设置为提前了解问题,以便我们能够提前将新成员添加到群集。请记住,当您将节点添加到集群ES时,将尝试重新平衡集群。只有3个具有较大初始文件集的节点才能运行生产,如果节点出现故障(堆耗尽,JVM错误,硬件故障,网络故障等),则可能会导致问题。 ES将要去黄色,并且在重新洗牌的过程中呆了一段时间。



个人对于大型文档编号和高摄取量,我将开始添加节点。有更多的节点到位,如果你把一个节点用于维护,这个问题就更少了。关于您现有的配置,您如何获得每个节点8 TB的HDD?假设每天摄入8GB,这似乎在6个月的数据中过度杀伤。我强烈怀疑给出您将要移动到更多节点的数据量和索引/分片数量,这将进一步减少每个节点的存储需求。



我一定要通过循环通过高容量摄取和正常查询频率的循环在一个集群上,只有1或2个节点来衡量每个节点的最大量的文档,并查看它在哪里失败(在性能,堆耗尽或其他问题)。然后,我计划保留每个节点的文件数量远低于该数字。



所有这一切都表示我会出去四肢,说我怀疑你'对于3个16GB节点,我将会感到满意的是40亿加文档。即使它工作(再次,测试,测试,测试)失去一个节点将是一个非常大的事件。我个人喜欢较小的节点,但更喜欢很多的节目。



其他想法 - 我们最初基于3个Amazon EC2 m1.xlarge实例(4个内核,15 GB内存)在每天80M的文件中摄入数天后,可以正常工作,这比您显示的平均文档大小更大。最大的问题是索引和分片数量的开放(我们每天创建了几百个新索引,每天可能会有几千个碎片,这是创建问题)。我们已经添加了几个新的节点,它们具有30GB的内存和8个内核,然后添加了另外80M的文件来测试它。我们目前的生产方式是保持更适中的节点,而不是几个大的节点。



更新:



<关于基准测试硬件,正如上面所述,在运行ubuntu 12.04 LTS和ES 1.1.0的3个Amazon EC2 m1.xlarge虚拟实例上进行了基准测试。我们每天运行大约80M的文档(从之前使用的MongoDB数据库中提取数据)。每个实例通过Amazon EBS拥有1 TB的存储空间,其中IOPS为1000 IOPS。我们跑了大约4-5天。我们似乎已经有一点cpu限制在每天80M,并相信更多的节点会增加我们的摄入率。随着基准运行,索引和分片数量的增加,我们看到内存压力增加。我们创建了大量索引和碎片(每个M文档大约4 -5个索引或每天约400个索引,每个索引有5个主碎片和1个复制碎片)。



关于索引别名,我们正在创建一个cron条目滚动索引别名1周后,2周后等,以便我们的应用程序可以打到一个已知的索引别名,并始终运行在一个设定的时间范围内从今天。我们正在使用索引别名rest api来创建和删除它们:



http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html


Context and current state

We are migrating our cluster from Cassandra to a full ElasticSearch cluster. We are indexing documents at average of ~250-300 docs per seconds. In ElasticSearch 1.2.0 it represents ~8Go per day.

{
 "generic":
    {
      "id": "twi471943355505459200",
      "type": "twitter",
      "title": "RT @YukBerhijabb: The Life is Choice - https://m.facebook.com/story.php?story_fbid=637864496306297&id=100002482564531&refid=17",
      "content": "RT @YukBerhijabb: The Life is Choice - https://m.facebook.com/story.php?story_fbid=637864496306297&id=100002482564531&refid=17",
      "source": "<a href=\"https://twitter.com/download/android\" rel=\"nofollow\">Twitter for  Android</a>",
      "geo": null,
      "link": "http://twitter.com/rosi_sifah/status/471943355505459200",
      "lang": "en",
      "created_at": 1401355038000,
      "author": {
        "username": "rosi_sifah",
        "name": "Rosifah",
        "id": 537798506,
        "avatar": "http://pbs.twimg.com/profile_images/458917673456238592/Im22zoIV_normal.jpeg",
        "link": "http://twitter.com/rosi_sifah"
      }
    },
 "twitter": {
   // a tweet JSON
 }
}

Our users save requests in our SQL database and when they ask for their dashboard we would like to request our ES cluster with their query (retrieved from database) and do some aggregation on top of it using the new ES aggregation framework.

Each dashboard is displayed with an explicit, user selected, date range so we always use

"range": {
 "generic.created_at": {
   "from": 1401000000000,
   "to": 1401029019706
  }
}

along with the ES query.

We specified _routing that way:

"_routing":{
 "required":true,
 "path":"generic.id"
},

and the _id with:

"_id": {
  "index": "not_analyzed",
  "store": "false",
  "path": "generic.id"
}

For approximately 5 days we've stored 67 millions documents (about 40Go) inside one index. We've learn about the good practice of spliting the index by day. So now our indices are splitted by day ([index-name]-[YYYY-MM-DD]).

Currently each index has 5 shards and 1 replica, we have a cluster composed of 3 machines each with 8 cores, 16Go of RAM and 8To of HDD. We plan to use another machine as a gateway (8 cores, 16Go of RAM, 1To of HDD).

We've leaved ES configuration by default besides the cluster configuration.

Questions

  1. For each document we want to index, we say explicitly what index to use. Currently we use the date of the day. Should we use the date of the document in order to prevent hot spot? Because currently it means that documents from various days (specified inside their created_at) can live in the same index of the current day.
  2. Are 5 shards enough (or too much) for 21 600 000 documents by day?
  3. If we want all our aggregate queries to be processed in less than 1 second how many replica should we setup up?
  4. Should we change our routing? Since we don't know ahead of time which documents will be processed before the aggregation for each request we make to the cluster (since the query is user defined)
  5. What kind of hardware (how many machines, what configuration) should we put inside this cluster to support 6 month of documents?

[Update]

Here is some example of queries:

A word cloud

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "(generic.lang:fr OR generic.lang:en) AND (generic.content:javascript)"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "words":{
      "terms":{
        "field": "generic.content",
        "size": 40
      }
    }
  }
}

An histogram

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "generic.content:apple"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "volume":{
      "date_histogram":{
        "field": "generic.created_at",
        "interval":"minute"
      }
    }
  }
}

Must used language

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "(generic.lang:fr OR generic.lang:en) AND (generic.content:javascript)"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "top_source":{
      "terms":{
        "field": "generic.lang"
      }
    }
  }
}

解决方案

Let me preface all of my answers/comments with the advice to try to, as much as possible, test these scenarios yourself. While Elasticsearch is very scalable, there are many tradeoffs that are heavily impacted by document size and type, ingestion and query volume, hardware and OS. While there are many wrong answers there is rarely one right answer.

I'm basing this response on a couple of active clusters with have with (currently) about half a million active documents in them, plus some recent benchmarking we performed at about 4X your volume (around 80M documents ingested per day during the benchmark).

1) First off you are not creating much of a hot spot with 3 nodes when you have even a single index with 5 shards and 1 replica per shard. Elasticsearch will separate each replica from it's primary to a different node, and in general will try to balance out the load of shards. Elasticsearch by default will hash on ID to pick the shard to index into (which then gets copied to the replica). Even with routing, you will only have a hot spot issue if you have single IDs that create large numbers of documents per day (which is the span of your index). Even then, it would not be a problem unless these IDs produce a significant percentage of the overall volume AND there are so few of them that you could get clumping on just 1 or 2 of the shards.

Only you can determine that based on your usage patterns - I'd suggest both some analysis of your existing data set to look for overly large concentrations and an analysis of your likely queries.

A bigger question I have is the nature of your queries. You aren't showing the full query nor the full schema (I see "generic.id" referenced but not in the document schema, and your query shows pulling up every single document within a time range - is that correct?). Custom routing for indexing is most useful when your queries are bound by an exact match on the field used for routing. So, if I had an index with everyone's documents in it and my query pattern was to only retrieve a single user's document in a single query, then custom routing by user id would be very useful to improve query performance and reduce overall cluster load.

A second thing to consider is the overall balance of ingestion vs. queries. You are ingesting over 20M documents a day - how many queries are you executing per day? If that number is <<< the ingestion rate you may want to think thru the need for a custom route. Also, if query performance is good or great you may not want to add the additional complexity.

Finally on indexing by ingestion date vs. created_at. We've struggled with that one too as we have some lag in receiving new documents. For now we've gone with storing by ingestion date as it's easier to manage and not a big issue to query multiple indexes at a time, particularly if you auto-create aliases for 1 week, 2 weeks, 1 month, 2 months etc. A bigger issue is what the distribution is - if you have documents that come in weeks or months later perhaps you want to change to indexing by created_at but that will require keeping those indexes online and open for quite some time.

We currently use several document indexes per day, basically "--" format. Practically this currently means 5 indexes per day. This allows us to be more selective about moving data in and out of the cluster. Not a recommendation for you just something that we've learned is useful to us.

2) Here's the great think about ES - with creating a new index each day you can adjust as time goes on to increase the number of shards per index. While you cannot change it for an existing index you are creating a new one every day and you can base your decision on real production analytics. You certainly want to watch the number and be prepared to increase the number of shards as/if you ingestion per day increases. It's not the simplest tradeoff - each one of those shards is a Lucene instance which potentially has multiple files. More shards per index is not free, as that multiplies with time. Given your use case of 6 months, that's over 1800 shards open across 3 nodes (182 days x 5 primaries and 5 replicas per day). There's multiples of files per shard likely open. We've found some level of overhead and impact on resource usage on our nodes as total shard count increased in the cluster into these ranges. Your mileage may vary but I'd be careful about increasing the number of shards per index when you are planning on keeping 182 indexes (6 months) at a time - that's quite a multiplier. I would definitely benchmark that ahead of time if you do make any changes to the default shard count.

3) There's not any way anyone else can predict query performance ahead of time for you. It's based on overall cluster load, query complexity, query frequency, hardware, etc. It's very specific to your environment. You are going to have to benchmark this. Personally given that you've already loaded data I'd use the ES snapshot and restore to bring this data up in a test environment. Try it with the default of 1 replica and see how it goes. Adding replica shards is great for data redundancy and can help spread out queries across the cluster but it comes at a rather steep price - 50% increase in storage plus each additional replica shard will bring additional ingestion cost to the node it runs on. It's great if you need the redundancy and can spare the capacity, not so great if you lack sufficient query volume to really take advantage of it.

4) Your question is incomplete (it ends with "we never") so I can't answer it directly - however a bigger question is why are you custom routing to begin with? Sure it can have great performance benefits but it's only useful if you can segment off a set of documents by the field you use to route. It's not entirely clear from your example data and partial query if that's the case. Personally I'd test it without any custom routing and then try the same with it and see if it has a significant impact.

5) Another question that will require some work on your part. You need to be tracking (at a minimum) JVM heap usage, overall memory and cpu usage, disk usage and disk io activity over time. We do, with thresholds set to alert well ahead of seeing issues so that we can add new members to the cluster early. Keep in mind that when you add a node to a cluster ES is going to try to re-balance the cluster. Running production with only 3 nodes with a large initial document set can cause issues if you lose a node to problems (Heap exhaustion, JVM error, hardware failure, network failure, etc). ES is going to go Yellow and stay there for quite some time while it reshuffles.

Personally for large document numbers and high ingestion I'd start adding nodes earlier. With more nodes in place it's less of an issue if you take a node out for maintenance. Concerning your existing configuration, how did you get to 8 TB of HDD per node? Given an ingestion of 8GB a day that seems like overkill for 6 months of data. I'd strongly suspect given the volume of data and number of indexes/shards you will want to move to more nodes which will even further reduce your storage per node requirement.

I'd definitely want to benchmark a maximum amount of documents per node by looping thru high volume ingestion and loops of normal query frequency on a cluster with just 1 or 2 nodes and see where it fails (either in performance, heap exhaustion or other issue). I'd then plan to keep the number of documents per node well below that number.

All that said I'd go out on a limb and say I doubt you'll be all that happy with 4 billion plus documents on 3 16GB nodes. Even if it worked (again, test, test, test) losing one node is going to be a really big event. Personally I like the smaller nodes but prefer lots of them.

Other thoughts - we initially benchmarked on 3 Amazon EC2 m1.xlarge instances (4 cores, 15 GB of memory) which worked fine over several days of ingestion at 80M documents a day which larger average document size than you appear to have. Biggest issue was the number of indexes and shards open (we were creating a couple of hundred new indexes per day with maybe a couple thousand more shards per day and this was creating issues). We've since added a couple of new nodes that have 30GB of memory and 8 cores and then added another 80M documents to test it out. Our current production approach is to keep prefer more moderately sized nodes as opposed to a few large ones.

UPDATE:

Regarding the benchmarking hardware, it was as stated above benchmarked on 3 Amazon EC2 m1.xlarge virtual instances running ubuntu 12.04 LTS and ES 1.1.0. We ran at about 80M documents a day (pull data out of a MongoDB database we had previously used). Each instance had 1 TB of storage via Amazon EBS, with provision IOPS of I believe 1000 IOPS. We ran for about 4-5 days. We appear to have been a bit cpu constrained at 80M a day and believe that more nodes would have increased our ingestion rate. As the benchmark ran and the number of indexes and shards increased we saw increasing memory pressure. We created a large number of indexes and shards (roughly 4 -5 indexes per 1 M documents or about 400 indexes per day, with 5 primary shards and 1 replica shard per index).

Regarding the index aliases, we're creating via a cron entry rolling index aliases for 1 week back, 2 weeks back, etc so that our application can just hit a known index alias and always run against a set time frame back from today. We're using the index aliases rest api to create and delete them:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html

这篇关于大型集群的弹性搜索设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆