火花数据框中再分区不起作用 [英] Repartitioning of dataframe in spark does not work

查看：269 发布时间：2016/5/22 15:48:06 apache-spark

本文介绍了火花数据框中再分区不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量记录〜400万卡桑德拉数据库。我有3个从机和一名司机。我想加载在内存中的火花在这个数据，做它的处理。当我做它读取一个从机的所有数据（300 MB出6 GB的），其他所有从机内存是闲置以下。我做数据框一个reparition成3，但仍数据是有一台机器上。正因为如此它需要大量的时间，因为每一项工作是在一台机器上执行来处理数据。这是我在做什么。

I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing

val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
        tabledf.registerTempTable("tempdf");
        _sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);   
val partitionedRdd = rdd.repartition(3)
        val count = partitionedRdd.count.toInt

当我做partitionedRdd这是因为所有的数据只有一台机器上执行某些操作的一台机器上present仅

When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only

更新
我在配置上采用这种--conf spark.cassandra.input.split.size_in_mb = 32，还是我的所有数据被加载到一个执行者

UPDATE I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor

更新
我使用的火花版本1.4和释放火花卡桑德拉连接器1.4版

Update I am using spark version 1.4 and spark cassandra connector version 1.4 released

火花数据框中再分区不起作用 [英] Repartitioning of dataframe in spark does not work

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花数据框中再分区不起作用 [英] Repartitioning of dataframe in spark does not work

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭