火花产生太多的分区 [英] spark creating too many partitions

查看:151
本文介绍了火花产生太多的分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个与8 GB RAM和2个内核1种子节点和1个火花主3从节点卡桑德拉节点群集。这里是输入到我的工作火花

I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs

spark.cassandra.input.split.size_in_mb 67108864

在我这个配置集我看到有周围89.1 MB的数据大致1706765记录的创建围绕768的分区上运行。我无法理解为什么创造那么多的分区​​。我使用卡桑德拉火花连接器版本1.4,以便对输入分割大小的bug也是固定的。

When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.

有只有11个独特的分区键。我的分区键具有应用程序的名字始终是检验和随机数始终是0-10所以只有11个不同的独特的分区。

There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.

为什么有这么多的分区​​和怎么来的火花决定多少分区创建

Why so many partitions and how come spark decide how much partitions to create

推荐答案

卡桑德拉连接器不使用defaultParallelism。它会检查C *(2.1.5后)为估计值的系统表上的数据多少MB如何在给定的表。这个量被读出并除以输入分割大小来确定分裂,以使数

The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.

<一个href=\"https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size\" rel=\"nofollow\">https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size

如果您对C *&LT; 2.1.5你需要通过ReadConf手动设置分区。

If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.

这篇关于火花产生太多的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆