Spark中sc.textFile和spark.read.text之间的区别 [英] Difference between sc.textFile and spark.read.text in Spark

查看：3396 发布时间：2020/9/4 2:02:29 apache-spark rdd

本文介绍了Spark中sc.textFile和spark.read.text之间的区别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将一个简单的文本文件读取到Spark RDD中，我发现有两种方法可以这样做:

I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so :

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
textRDD1 = sc.textFile("hobbit.txt")
textRDD2 = spark.read.text('hobbit.txt').rdd

然后我查看数据，发现两个RDD的结构不同

then I look into the data and see that the two RDDs are structured differently

textRDD1.take(5)

['The king beneath the mountain',
 'The king of carven stone',
 'The lord of silver fountain',
 'Shall come unto his own',
 'His throne shall be upholden']

textRDD2.take(5)

[Row(value='The king beneath the mountain'),
 Row(value='The king of carven stone'),
 Row(value='The lord of silver fountain'),
 Row(value='Shall come unto his own'),
 Row(value='His throne shall be upholden')]

基于此，必须更改所有后续处理以反映值"的存在

Based on this, all subsequent processing has to be changed to reflect the presence of the 'value'

我的问题是

使用这两种方式读取文本文件意味着什么?
在什么情况下我们应该使用哪种方法?

推荐答案

要回答(a)，

sc.textFile(...)返回RDD[String]

textFile(String path, int minPartitions)

从HDFS，本地文件系统(在所有节点上都可用)或任何Hadoop支持的文件系统URI中读取文本文件，并将其作为字符串的RDD返回.

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

spark.read.text(...)返回DataSet[Row]或DataFrame

text(String path)

加载文本文件并返回一个DataFrame，其架构以名为值"的字符串列开头，然后是分区列(如果有的话).

Loads text files and returns a DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any.

对于(b)，这实际上取决于您的用例.由于您尝试在此处创建RDD，因此应使用sc.textFile.您始终可以将数据帧转换为rdd，反之亦然.

For (b), it really depends on your use case. Since you are trying to create a RDD here, you should go with sc.textFile. You can always convert a dataframe to a rdd and vice-versa.

这篇关于Spark中sc.textFile和spark.read.text之间的区别的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark中sc.textFile和spark.read.text之间的区别 [英] Difference between sc.textFile and spark.read.text in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark中sc.textFile和spark.read.text之间的区别 [英] Difference between sc.textFile and spark.read.text in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭