spark数据帧转换为rdd需要很长时间 [英] spark dataframe conversion to rdd takes a long time

查看:52
本文介绍了spark数据帧转换为rdd需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将社交网络的 json 文件读入 spark.我从这些数据框中得到了一个数据框,我将它分解成对.这个过程很完美.后来我想将其转换为 RDD(与 GraphX 一起使用)但 RDD 创建需要很长时间.

I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time.

val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
    withColumn("follower", explode($"followers")).
    withColumn("id_follower", ($"follower").cast("long")).
    withColumn("id_account", ($"account").cast("long")).
    withColumn("relationship", lit(1)).
    select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd

为了检查流程是如何运行的,我会在每一步进行计数

To check for how the process is run, I count at each step

scala> exploded_network.count
res0: Long = 18205814 // 3 seconds

scala> E1.count
res1: Long = 18205814 // 3 seconds

scala> E2.count // 5.4 minutes
res2: Long = 18205814

为什么 RDD 转换需要 100 倍?

Why is RDD conversion taking 100x?

推荐答案

在 Spark 中,DataFrame 是组织成命名列(表格格式)的分布式数据集合.它在概念上等同于关系数据库中的表或 R/Python 中的数据框,但具有更丰富的优化.而且由于其表格格式,它具有元数据,允许 spark 在后台运行许多优化.DataFrame API 使用 Spark 的高级优化(如 Tungsten 执行引擎和催化剂优化器)来更好地处理数据.

In Spark, a DataFrame is a distributed collection of data organized into named columns(tabular format). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. And also due to its tabular format, it has metadata which allows spark to run number of optimizations in the background. DataFrame API uses spark’s advanced optimizations like the Tungsten execution engine and catalyst optimizer to better process the data.

而在 RDD 中,RDD 不会推断给定数据集的模式,而是要求用户提供任何模式.此外,Rdd 不能利用 Spark 的优化器,如 Catalyst 优化器和 Tungsten 执行引擎(如上所述).

Whereas in a RDD, RDD's don't infer the schema of given data set and requires the user to provide any schema.Also Rdd's cannot take advantage of spark's optimizers like Catalyst optimizer and Tungsten execution engine(as mentioned above).

所以 DataFrame 的性能比 RDD 好得多.在您的情况下,如果您必须使用 RDD 而不是数据帧,我建议在转换为 rdd 之前缓存数据帧.这应该会提高您的 rdd 性能.

So DataFrame's have much better performance than RDD's. In your case, if you have to use an RDD instead of dataframe, I would recommend to cache the dataframe before converting to rdd. That should improve your rdd performance.

val E1 = exploded_network.cache()
val E2 = E1.rdd

希望这会有所帮助.

这篇关于spark数据帧转换为rdd需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆