如何在 pySpark 中使用 with open 打开存储在 HDFS 中的文件 [英] How to open a file which is stored in HDFS in pySpark using with open

查看:59
本文介绍了如何在 pySpark 中使用 with open 打开存储在 HDFS 中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何打开存储在 HDFS 中的文件 - 这里输入的文件来自 HDFS - 如果我给出如下文件,我将无法打开,它将显示为找不到文件

from pyspark import SparkConf,SparkContextconf = SparkConf()sc = SparkContext(conf = conf)def getMovieName():电影名称 = {}打开 ("/user/sachinkerala6174/inData/movieStat") 为 f:对于 f 中的行:fields = line.split("|")mID = 字段 [0]mName = 字段[1]电影名称[int(fields[0])] = fields[1]返回电影名称nameDict = sc.broadcast(getMovieName())

我的假设是使用 like

 with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:

但那也没有用

解决方案

textfile读入rdd:

rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")

您可以使用 collect() 以便在纯 python 中使用它(不推荐 - 仅用于非常小的数据),或者使用 spark rdd 方法使用 pyspark 方法操作它(推荐的方式)

更多信息pyspark API::><块引用>

textFile(name, minPartitions=None, use_unicode=True)

从本地文件系统 HDFS 读取文本文件(适用于所有节点)或任何 Hadoop 支持的文件系统 URI,并将其作为字符串的 RDD.

如果 use_unicode 为 False,则字符串将保留为 str(编码为utf-8),比 unicode 更快更小.(在 Spark 1.2 中添加)

<预><代码>>>>path = os.path.join(tempdir, "sample-text.txt")>>>使用 open(path, "w") 作为 testFile:... _ = testFile.write("Hello world!")>>>textFile = sc.textFile(path)>>>textFile.collect()[你好世界!']

How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found

from pyspark import SparkConf,SparkContext
conf = SparkConf ()
sc = SparkContext(conf = conf)
def getMovieName():
    movieNames = {}
    with open ("/user/sachinkerala6174/inData/movieStat") as f:
        for line in f:
            fields = line.split("|")
            mID = fields[0]
            mName = fields[1]
            movieNames[int(fields[0])] = fields[1]
            return movieNames
nameDict = sc.broadcast(getMovieName())

My assumption was to use like

with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:

But that also didnt work

解决方案

To read the textfile into rdd:

rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")

You can use collect() in order to use it in pure python (not recommended - use only on very small data), or use spark rdd methods in order to manipulate it using pyspark methods (the recommended way)

More info pyspark API:

textFile(name, minPartitions=None, use_unicode=True)

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)

>>> path = os.path.join(tempdir, "sample-text.txt")
>>> with open(path, "w") as testFile:
...    _ = testFile.write("Hello world!")
>>> textFile = sc.textFile(path)
>>> textFile.collect()
[u'Hello world!']

这篇关于如何在 pySpark 中使用 with open 打开存储在 HDFS 中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆