Python从HDFS读取文件作为流 [英] Python read file as stream from HDFS

查看:832
本文介绍了Python从HDFS读取文件作为流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我的问题:我有一个HDFS文件,它可能很大(=不足以容纳所有内存)

我想要做什么是避免必须缓存这个文件在内存中,并且只能像处理常规文件一样逐行处理它:

  for line in open(myfile,r):
#做一些处理

我正在查看是否有一种简单的方法可以在不使用外部库的情况下正确完成此操作。我可以使它与 libpyhdfs python-hdfs ,但我希望尽可能避免在系统中引入新的依赖关系和未经测试的库,特别是因为这两者似乎没有严格维护并声明不应使用它们在生产中。



我正在考虑使用标准的hadoop命令行工具,使用Python 子流程模块,但我似乎无法做到我所需要的,因为没有可执行我的处理的命令行工具,并且我想为流式处理的每一行执行Python函数。



有没有办法将Python函数用作使用子进程模块的管道的右操作数?或甚至更好,打开它像一个文件作为生成器,所以我可以轻松地处理每一行?



pre $ cat $ subprocess.Popen [hadoop,fs,-cat,/ path / to / myfile],stdout = subprocess.PIPE)

如果有另外一种方法可以实现我上面介绍的而不使用外部库,我也很开放。



感谢任何帮助!

解决方案

您需要 xreadlines ,它从文件中读取行而不会将整个文件加载到内存中。

编辑



现在我看到您的问题,您只需从 Popen object:

  cat = subprocess.Popen([hadoop,fs,-cat,/ path / to / myfile],stdout = subprocess.PIPE)
for cat.stdout中的行:
print line


Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Thanks for any help !

解决方案

You want xreadlines, it reads lines from a file without loading the whole file into memory.

Edit:

Now I see your question, you just need to get the stdout pipe from your Popen object:

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
    print line

这篇关于Python从HDFS读取文件作为流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆