什么时候启动其他弹性搜索节点? [英] When do you start additional Elasticsearch nodes?

查看:121
本文介绍了什么时候启动其他弹性搜索节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图用Elasticsearch替换Solr设置。这是一个新的设置,尚未看到生产,所以我有很多空间来解决事情,让他们工作得很好。



我有非常非常大的数据量。我正在索引一些实时数据并持有它7天(通过使用_ttl字段)。我不在索引中存储任何数据(并禁用_source字段)。我期望我的指数稳定在 200亿行。我将这个数据放在2-3个命名的索引中。到目前为止,搜索性能达数十亿行是完全可以接受的,但索引性能是一个问题。



我对ES内部使用碎片有点困惑。我创建了两个ES节点,每个节点都有一个单独的数据目录,每个节点有8个索引和1个副本。当我查看群集状态时,我只看到每个节点的一个碎片和一个副本。每个节点是否保持多个索引在内部运行? (检查磁盘上的存储位置显示确实只有一个Lucene索引存在)。 - 已解决,因为我的索引设置没有从配置中正确选择。使用API​​创建索引并指定碎片和副本的数量现在已经准确地产生了我预期的内容。



另外,我试着运行同一ES节点的多个副本(来自相同的配置),并且它识别出已经有一个运行的副本并创建了自己的工作区域。节点的这些新实例似乎只有一个磁盘上的索引。 - 现在每个节点实际上都使用多个索引,具有很多索引的单个节点足以节制整个系统,所以这个是一个非问题。



您什么时候启动其他Elasticsearch节点,以获得最大的索引性能?我应该有多个节点每个运行1个索引1个副本,或更少的节点吨索引?有没有我缺少我的配置,以使单个节点做更多的工作?



另外:是否有任何衡量HTTP唯一节点何时重载的指标?现在我有一个专用于HTTP的节点,但除了CPU使用情况,我无法判断它是否正常。什么时候开始附加的HTTP节点并拆分索引软件来指向各个节点?

解决方案

术语一点点:




  • Node :运行(一个java进程)的Elasticsearch实例。

  • 集群:具有相同集群名称的一个或多个节点。


  • 类型:或多或少像数据库表。

  • 碎片:有效的lucene索引。每个索引由一个或多个碎片组成。分片可以是主分片(或简单的分片)或副本



创建一个索引可以指定碎片的数量和每个分片的副本数。默认值为5个主分片,每个分片1个副本。碎片会自动均匀分布在群集上。复制分片将永远不会分配在相关主分片的同一台机器上。



您在群集状态中看到的是奇怪的,建议您使用获取设置API 。看起来您只配置了一个分片,但是无论如何,如果您有多个索引,您应该看到更多的分片。如果您需要更多帮助,您可以发布您从弹性搜索获得的输出。



您使用的碎片和副本真的取决于您的数据,访问它们的方式和可用节点/服务器的数量。最好的做法是将碎片分配一些,以便重新分配它们,以防您向群集中添加更多节点,因为您无法(现在)在创建索引后更改碎片数。否则,如果您愿意完成数据的重新索引,您可以随时更改分片数。



每个附加的分片都带有成本,因为每个分片实际上是一个Lucene实例。每个机器可以拥有的最大碎片数确实取决于可用的硬件和数据。很高兴知道,每个碎片或一个具有100个碎片的索引都有100个索引是一样的,因为您在这两种情况下都有100个lucene实例。



当然在查询时候,如果要查询由100个碎片组成的单个弹性搜索索引,则需要查询它们才能获得正确的结果(除非您使用特定的文档路由才能查询特定的分片)。这将具有性能成本。



您可以使用群集节点信息API ,通过它可以检查大量有用的信息,您需要知道的所有信息您的节点是否顺利运行?更容易的是,有几个插件通过一个漂亮的用户界面来检查这些信息(其内部使用弹性搜索API):辅助医护人员 bigdesk


I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.

I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.

I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.

Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.

When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?

Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?

解决方案

Let's clarify the terminology a little first:

  • Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine.
  • Cluster: one or more nodes with the same cluster name.
  • Index: more or less like a database.
  • Type: more or less like a database table.
  • Shard: effectively a lucene index. Every index is composed of one or more shards. A shard can be a primary shard (or simply shard) or a replica.

When you create an index you can specify the number of shards and number of replicas per shard. The default is 5 primary shards and 1 replica per shard. The shards are automatically evenly distributed over the cluster. A replica shard will never be allocated on the same machine where the related primary shard is.

What you see in the cluster status is weird, I'd suggest to check your index settings using the using the get settings API. Looks like you configured only one shard, but anyway you should see more shards if you have more than one index. If you need more help you can post the output that you get from elasticsearch.

How many shards and replicas you use really depends on your data, the way you access them and the number of available nodes/servers. It's best practice to overallocate shards a little in order to redistribute them in case you add more nodes to your cluster, since you can't (for now) change the number of shards once you created the index. Otherwise you can always change the number of shards if you are willing to do a complete reindex of your data.

Every additional shard comes with a cost since each shard is effectively a Lucene instance. The maximum number of shards that you can have per machine really depends on the hardware available and your data as well. Good to know that having 100 indexes with each one shard or one index with 100 shards is really the same since you'd have 100 lucene instances in both cases.

Of course at query time if you want to query a single elasticsearch index composed of 100 shards elasticsearch would need to query them all in order to get proper results (unless you used a specific routing for your documents to then query only a specific shard). This would have a performance cost.

You can easily check the state of your cluster and nodes using the Cluster Nodes Info API through which you can check a lot of useful information, all you need in order to know whether your nodes are running smoothly or not. Even easier, there are a couple of plugins to check those information through a nice user interface (which internally uses the elasticsearch APIs anyway): paramedic and bigdesk.

这篇关于什么时候启动其他弹性搜索节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆