将csv文件加载到RDD和spark中的Dataframe之间的区别 [英] Difference between loading a csv file into RDD and Dataframe in spark

查看:34
本文介绍了将csv文件加载到RDD和spark中的Dataframe之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定这个特定问题是否更早被问到.可能是重复的,但我无法找到坚持此的用例.

I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this.

正如我们所知,我们可以将 csv 文件直接加载到数据帧中,也可以将其加载到 RDD 中,然后将该 RDD 转换为数据帧.

As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later.

RDD = sc.textFile("pathlocation")

我们可以对这个RDD应用一些Map、filter等操作,将其转化为dataframe.

we can apply some Map, filter and other operations on this RDD and can convert it into dataframe.

我们也可以创建一个直接读取csv文件的数据框

Also we can create a dataframe directly reading a csv file

Dataframe = spark.read.format("csv").schema(schema).option("header","false").load("pathlocation")

我的问题是,当我们必须首先使用 RDD 加载文件并将其转换为数据帧时,可能的用例是什么?

My question is that what could be the use cases when we have to load a file using RDD first and convert it into dataframe?

我只知道 textFile 逐行读取数据.当我们必须在数据帧上选择 RDD 方法时,可能会出现哪些情况?

I just know that textFile reads data line by line. What could be the scenarios when we have to choose RDD method over dataframe?

推荐答案

DataFrames/Datasets 提供了比 RDDs 巨大的性能提升,因为它有两个强大的特性:

DataFrames / Datasets offer huge performance improvement over RDDs because of 2 powerful features:

  1. 自定义内存管理(又名 Project Tungsten)数据以二进制格式存储在堆外内存中.这样可以节省大量内存空间.也没有涉及垃圾收集开销.通过提前了解数据的模式并以二进制格式高效存储,还可以避免昂贵的 Java 序列化.

  1. Custom Memory management (aka Project Tungsten) Data is stored in off-heap memory in binary format. This saves a lot of memory space. Also there is no Garbage Collection overhead involved. By knowing the schema of data in advance and storing efficiently in binary format, expensive java Serialization is also avoided.

优化的执行计划(又名 Catalyst Optimizer)
使用 Spark 催化剂优化器为执行创建查询计划.在经过一些步骤准备好优化的执行计划后,最终执行仅在 RDD 内部发生,但对用户完全隐藏.

Optimized Execution Plans (aka Catalyst Optimizer)
Query plans are created for execution using Spark catalyst optimiser. After an optimised execution plan is prepared going through some steps, the final execution happens internally on RDDs only but thats completely hidden from the users.

一般来说,除非您想自己处理低级优化/序列化,否则永远不要使用 RDD.

In general, you should never use RDD's unless you want to handle the low-level optimizations / serializations yourself.

PySpark 中的客户分区器实现,带有 RDD:

Customer Partitioner implementation in PySpark, with RDD's:

def partitionFunc(key):
import random
if key == 17850 or key == 12583:
return 0
else:
return random.randint(1,2)

# You can call the Partitioner as below:
keyedRDD = rdd.keyBy(lambda row: row[6])
keyedRDD\
.partitionBy(3, partitionFunc)\
.map(lambda x: x[0])\
.glom()\
.map(lambda x: len(set(x)))\
.take(5)

这篇关于将csv文件加载到RDD和spark中的Dataframe之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆