火花产生太多的分区 [英] spark creating too many partitions
问题描述
我有3个与8 GB RAM和2个内核1种子节点和1个火花主3从节点卡桑德拉节点群集。这里是输入到我的工作火花
I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
在我这个配置集我看到有周围89.1 MB的数据大致1706765记录的创建围绕768的分区上运行。我无法理解为什么创造那么多的分区。我使用卡桑德拉火花连接器版本1.4,以便对输入分割大小的bug也是固定的。
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
有只有11个独特的分区键。我的分区键具有应用程序的名字始终是检验和随机数始终是0-10所以只有11个不同的独特的分区。
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
为什么有这么多的分区和怎么来的火花决定多少分区创建
Why so many partitions and how come spark decide how much partitions to create
推荐答案
卡桑德拉连接器不使用defaultParallelism。它会检查C *(2.1.5后)为估计值的系统表上的数据多少MB如何在给定的表。这个量被读出并除以输入分割大小来确定分裂,以使数
The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
<一个href=\"https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size\" rel=\"nofollow\">https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
如果您对C *&LT; 2.1.5你需要通过ReadConf手动设置分区。
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.
这篇关于火花产生太多的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!