火花数据框中再分区不起作用 [英] Repartitioning of dataframe in spark does not work

查看:269
本文介绍了火花数据框中再分区不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量记录〜400万卡桑德拉数据库。我有3个从机和一名司机。我想加载在内存中的火花在这个数据,做它的处理。当我做它读取一个从机的所有数据(300 MB出6 GB的),其他所有从机内存是闲置以下。我做数据框一个reparition成3,但仍数据是有一台机器上。正因为如此它需要大量的时间,因为每一项工作是在一台机器上执行来处理数据。这是我在做什么。

I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing

val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
        tabledf.registerTempTable("tempdf");
        _sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);   
val partitionedRdd = rdd.repartition(3)
        val count = partitionedRdd.count.toInt

当我做partitionedRdd这是因为所有的数据只有一台机器上执行某些操作的一台机器上present仅

When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only

更新
我在配置上采用这种--conf spark.cassandra.input.split.size_in_mb = 32,还是我的所有数据被加载到一个执行者

UPDATE I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor

在这里输入的形象描述

更新
我使用的火花版本1.4和释放火花卡桑德拉连接器1.4版

Update I am using spark version 1.4 and spark cassandra connector version 1.4 released

推荐答案

如果查询只访问一个单一的C *分区键,你会仅仅因为我们没有办法(还)得到一个任务自动获得单个卡桑德拉分区并联。如果您正在访问多个C *分区再尝试进一步的以MB为单位缩小输入split_size。

If "Query" only accesses a single C* partition key you will only get a single task because we don't have a way (yet) of automatically getting a single cassandra partition in parallel. If you are accessing multiple C* partitions then try futher shrinking the input split_size in mb.

这篇关于火花数据框中再分区不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆