在Spark中将CSV文件加载到RDD和Dataframe之间的区别 [英] Difference between loading a csv file into RDD and Dataframe in spark

查看:57
本文介绍了在Spark中将CSV文件加载到RDD和Dataframe之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定是否早些询问此特定问题.可能是重复的,但我找不到持久的用例.

I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this.

我们知道我们可以将一个csv文件直接加载到数据帧中,也可以将其加载到RDD中,然后在以后将该RDD转换为数据帧中.

As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later.

RDD = sc.textFile("pathlocation")

我们可以在此RDD上应用一些Map,过滤和其他操作,并将其转换为数据框.

we can apply some Map, filter and other operations on this RDD and can convert it into dataframe.

我们还可以创建一个直接读取csv文件的数据框

Also we can create a dataframe directly reading a csv file

Dataframe = spark.read.format("csv").schema(schema).option("header","false").load("pathlocation")

我的问题是,当我们必须首先使用RDD加载文件并将其转换为数据帧时,会是什么用例?

My question is that what could be the use cases when we have to load a file using RDD first and convert it into dataframe?

我只知道textFile逐行读取数据.当我们必须在数据帧上选择RDD方法时,会出现什么情况?

I just know that textFile reads data line by line. What could be the scenarios when we have to choose RDD method over dataframe?

推荐答案

DataFrames/Datasets具有相对于RDD的巨大性能改进,这归功于以下两项强大的功能:

DataFrames / Datasets offer huge performance improvement over RDDs because of 2 powerful features:

  1. 自定义内存管理(又名Project Tungsten)数据以二进制格式存储在堆外存储器中.这样可以节省大量的存储空间.而且,不涉及垃圾收集开销.通过预先了解数据模式并以二进制格式有效存储,还避免了昂贵的Java序列化.

  1. Custom Memory management (aka Project Tungsten) Data is stored in off-heap memory in binary format. This saves a lot of memory space. Also there is no Garbage Collection overhead involved. By knowing the schema of data in advance and storing efficiently in binary format, expensive java Serialization is also avoided.

优化的执行计划(又称为Catalyst Optimizer)
创建查询计划以使用Spark催化剂优化程序执行.在准备了经过优化的执行计划并执行了一些步骤之后,最终执行仅在内部发生在RDD上,而对用户完全隐藏了.

Optimized Execution Plans (aka Catalyst Optimizer)
Query plans are created for execution using Spark catalyst optimiser. After an optimised execution plan is prepared going through some steps, the final execution happens internally on RDDs only but thats completely hidden from the users.

通常,除非您想自己处理低级别的优化/序列化操作,否则永远不要使用RDD.

In general, you should never use RDD's unless you want to handle the low-level optimizations / serializations yourself.

PySpark中的客户分区程序实现,带有RDD:

Customer Partitioner implementation in PySpark, with RDD's:

def partitionFunc(key):
import random
if key == 17850 or key == 12583:
return 0
else:
return random.randint(1,2)

# You can call the Partitioner as below:
keyedRDD = rdd.keyBy(lambda row: row[6])
keyedRDD\
.partitionBy(3, partitionFunc)\
.map(lambda x: x[0])\
.glom()\
.map(lambda x: len(set(x)))\
.take(5)

这篇关于在Spark中将CSV文件加载到RDD和Dataframe之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆