Python 从 HDFS 读取文件作为流 [英] Python read file as stream from HDFS

查看:39
本文介绍了Python 从 HDFS 读取文件作为流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题:我在 HDFS 中有一个文件,该文件可能很大(=不足以容纳所有内存)

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

我想做的是避免将这个文件缓存在内存中,只像处理普通文件一样逐行处理它:

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

我想看看是否有一种简单的方法可以在不使用外部库的情况下完成这项工作.我可能可以使用 libpyhdfspython-hdfs 但如果可能的话,我希望避免在系统中引入新的依赖项和未经测试的库,特别是因为这两个似乎都没有得到大量维护并声明不应使用它们生产中.

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

我想通过使用 Python subprocess 模块的标准hadoop"命令行工具来执行此操作,但由于没有命令,我似乎无法执行所需的操作可以进行处理的线路工具,我想以流式方式为每条线路执行一个 Python 函数.

I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

有没有办法使用 subprocess 模块将 Python 函数应用为管道的右操作数?或者更好的是,将它像文件一样作为生成器打开,以便我可以轻松处理每一行?

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

如果有另一种方法可以在不使用外部库的情况下实现我上面描述的内容,我也很开放.

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

感谢您的帮助!

推荐答案

你想要 xreadlines,它从文件中读取行而不将整个文件加载到内存中.

You want xreadlines, it reads lines from a file without loading the whole file into memory.

编辑:

现在我看到你的问题了,你只需要从你的 Popen 对象中获取标准输出管道:

Now I see your question, you just need to get the stdout pipe from your Popen object:

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
    print line

这篇关于Python 从 HDFS 读取文件作为流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆