Python 从 HDFS 读取文件作为流 [英] Python read file as stream from HDFS

查看：39 发布时间：2021/12/15 18:21:09 python hadoop subprocess hdfs

本文介绍了Python 从 HDFS 读取文件作为流的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的问题:我在 HDFS 中有一个文件，该文件可能很大(=不足以容纳所有内存)

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

我想做的是避免将这个文件缓存在内存中，只像处理普通文件一样逐行处理它:

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

我想看看是否有一种简单的方法可以在不使用外部库的情况下完成这项工作.我可能可以使用 libpyhdfs 或 python-hdfs 但如果可能的话，我希望避免在系统中引入新的依赖项和未经测试的库，特别是因为这两个似乎都没有得到大量维护并声明不应使用它们生产中.

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

我想通过使用 Python subprocess 模块的标准hadoop"命令行工具来执行此操作，但由于没有命令，我似乎无法执行所需的操作可以进行处理的线路工具，我想以流式方式为每条线路执行一个 Python 函数.

I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

有没有办法使用 subprocess 模块将 Python 函数应用为管道的右操作数?或者更好的是，将它像文件一样作为生成器打开，以便我可以轻松处理每一行?

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

如果有另一种方法可以在不使用外部库的情况下实现我上面描述的内容，我也很开放.

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

感谢您的帮助！

Python 从 HDFS 读取文件作为流 [英] Python read file as stream from HDFS

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python 从 HDFS 读取文件作为流 [英] Python read file as stream from HDFS

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭