火花数据框中再分区不起作用 [英] Repartitioning of dataframe in spark does not work
问题描述
我有大量记录〜400万卡桑德拉数据库。我有3个从机和一名司机。我想加载在内存中的火花在这个数据,做它的处理。当我做它读取一个从机的所有数据(300 MB出6 GB的),其他所有从机内存是闲置以下。我做数据框一个reparition成3,但仍数据是有一台机器上。正因为如此它需要大量的时间,因为每一项工作是在一台机器上执行来处理数据。这是我在做什么。
I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
tabledf.registerTempTable("tempdf");
_sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);
val partitionedRdd = rdd.repartition(3)
val count = partitionedRdd.count.toInt
当我做partitionedRdd这是因为所有的数据只有一台机器上执行某些操作的一台机器上present仅
When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only
更新
我在配置上采用这种--conf spark.cassandra.input.split.size_in_mb = 32,还是我的所有数据被加载到一个执行者
UPDATE I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor
更新
我使用的火花版本1.4和释放火花卡桑德拉连接器1.4版
Update I am using spark version 1.4 and spark cassandra connector version 1.4 released
推荐答案
如果查询只访问一个单一的C *分区键,你会仅仅因为我们没有办法(还)得到一个任务自动获得单个卡桑德拉分区并联。如果您正在访问多个C *分区再尝试进一步的以MB为单位缩小输入split_size。
If "Query" only accesses a single C* partition key you will only get a single task because we don't have a way (yet) of automatically getting a single cassandra partition in parallel. If you are accessing multiple C* partitions then try futher shrinking the input split_size in mb.
这篇关于火花数据框中再分区不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!