spark dataframe转换为rdd需要很长时间 [英] spark dataframe conversion to rdd takes a long time

查看:135
本文介绍了spark dataframe转换为rdd需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将社交网络的json文件读入spark.我从这些数据中得到一个爆炸成对的数据框.此过程非常完美.稍后我想将其转换为RDD(用于GraphX),但是RDD的创建需要很长时间.

  val social_network = spark.read.json(my/path)//200MBval exploded_network = social_network.withColumn("followers,explode($" followers)).withColumn("id_follower",($"follower").cast("long")).withColumn("id_account",($"account").cast("long")).withColumn("relationship",lit(1)).select("id_follower","id_account",关系")val E1 = exploded_network.as [(VertexId,VertexId,Int)]值E2 = E1.rdd 

要检查该流程的运行方式,我在每一步进行计数

  scala>exploded_network.countres0:长= 18205814//3秒斯卡拉>E1.countres1:长= 18205814//3秒斯卡拉>E2.count//5.4分钟res2:长= 18205814 

为什么RDD转换需要100倍?

解决方案

在Spark中,DataFrame是组织为命名列(表格格式)的分布式数据集合.它在概念上等效于关系数据库中的表或R/Python中的数据框,但具有更丰富的优化.而且由于其表格格式,它具有允许火花在后台运行大量优化的元数据.DataFrame API使用Spark的高级优化(例如Tungsten执行引擎和催化剂优化器)来更好地处理数据.

在RDD中,RDD不会推断给定数据集的架构,而是要求用户提供任何架构.Rdd不能利用Spark的优化器(如Catalyst优化器和Tungsten执行引擎(如上所述))的优势./p>

因此,DataFrame的性能比RDD更好.在您的情况下,如果必须使用RDD而不是数据帧,则建议在转换为rdd之前先缓存数据帧.这样可以提高rdd的性能.

  val E1 = exploded_network.cache()值E2 = E1.rdd 

希望这会有所帮助.

I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time.

val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
    withColumn("follower", explode($"followers")).
    withColumn("id_follower", ($"follower").cast("long")).
    withColumn("id_account", ($"account").cast("long")).
    withColumn("relationship", lit(1)).
    select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd

To check for how the process is run, I count at each step

scala> exploded_network.count
res0: Long = 18205814 // 3 seconds

scala> E1.count
res1: Long = 18205814 // 3 seconds

scala> E2.count // 5.4 minutes
res2: Long = 18205814

Why is RDD conversion taking 100x?

解决方案

In Spark, a DataFrame is a distributed collection of data organized into named columns(tabular format). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. And also due to its tabular format, it has metadata which allows spark to run number of optimizations in the background. DataFrame API uses spark’s advanced optimizations like the Tungsten execution engine and catalyst optimizer to better process the data.

Whereas in a RDD, RDD's don't infer the schema of given data set and requires the user to provide any schema.Also Rdd's cannot take advantage of spark's optimizers like Catalyst optimizer and Tungsten execution engine(as mentioned above).

So DataFrame's have much better performance than RDD's. In your case, if you have to use an RDD instead of dataframe, I would recommend to cache the dataframe before converting to rdd. That should improve your rdd performance.

val E1 = exploded_network.cache()
val E2 = E1.rdd

Hope this helps.

这篇关于spark dataframe转换为rdd需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆