如何在 pySpark 中使用 with open 打开存储在 HDFS 中的文件 [英] How to open a file which is stored in HDFS in pySpark using with open
问题描述
如何打开存储在 HDFS 中的文件 - 这里输入的文件来自 HDFS - 如果我给出如下文件,我将无法打开,它将显示为找不到文件
from pyspark import SparkConf,SparkContextconf = SparkConf()sc = SparkContext(conf = conf)def getMovieName():电影名称 = {}打开 ("/user/sachinkerala6174/inData/movieStat") 为 f:对于 f 中的行:fields = line.split("|")mID = 字段 [0]mName = 字段[1]电影名称[int(fields[0])] = fields[1]返回电影名称nameDict = sc.broadcast(getMovieName())
我的假设是使用 like
with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:
但那也没有用
将textfile
读入rdd
:
rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")
您可以使用 collect()
以便在纯 python 中使用它(不推荐 - 仅用于非常小的数据),或者使用 spark rdd
方法使用 pyspark
方法操作它(推荐的方式)
更多信息pyspark API::><块引用>
textFile(name, minPartitions=None, use_unicode=True)
从本地文件系统 HDFS 读取文本文件(适用于所有节点)或任何 Hadoop 支持的文件系统 URI,并将其作为字符串的 RDD.
如果 use_unicode 为 False,则字符串将保留为 str(编码为utf-8),比 unicode 更快更小.(在 Spark 1.2 中添加)
<预><代码>>>>path = os.path.join(tempdir, "sample-text.txt")>>>使用 open(path, "w") 作为 testFile:... _ = testFile.write("Hello world!")>>>textFile = sc.textFile(path)>>>textFile.collect()[你好世界!']How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found
from pyspark import SparkConf,SparkContext
conf = SparkConf ()
sc = SparkContext(conf = conf)
def getMovieName():
movieNames = {}
with open ("/user/sachinkerala6174/inData/movieStat") as f:
for line in f:
fields = line.split("|")
mID = fields[0]
mName = fields[1]
movieNames[int(fields[0])] = fields[1]
return movieNames
nameDict = sc.broadcast(getMovieName())
My assumption was to use like
with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:
But that also didnt work
To read the textfile
into rdd
:
rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")
You can use collect()
in order to use it in pure python (not recommended - use only on very small data), or use spark rdd
methods in order to manipulate it using pyspark
methods (the recommended way)
More info pyspark API:
textFile(name, minPartitions=None, use_unicode=True)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
>>> path = os.path.join(tempdir, "sample-text.txt") >>> with open(path, "w") as testFile: ... _ = testFile.write("Hello world!") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello world!']
这篇关于如何在 pySpark 中使用 with open 打开存储在 HDFS 中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!