列出不带pydoop的HDFS Python中的所有文件 [英] List all files in HDFS Python without pydoop
问题描述
我有一个在centos 6.5上运行的hadoop集群.我目前正在使用python 2.6.由于不相关的原因,我无法升级到python 2.7.由于这个不幸的事实,我无法安装pydoop.在hadoop集群中,我有大量名为raw"yearmonthdaytimehour" .txt的原始数据文件,括号中的所有内容都是数字. 有没有办法在python的hadoop目录中列出所有文件的列表?因此,程序将创建一个类似如下的列表.
I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install pydoop. Inside the hadoop cluster i have a large amount of raw data files named raw"yearmonthdaytimehour".txt everything in parenthesis is a number. Is there a way to make a list of all the files in a hadoop directory in python? So the program would create a list that looks something like.
listoffiles=['raw160317220001.txt', 'raw160317230001.txt', ....]
这将使我需要做的一切变得容易得多,因为要从第2天15开始获取文件,我只需要调用dothing(listoffiles [39]).有无关的复杂性说明为什么我必须这样做.
It would make everything i need to do a lot easier since to get the file from day 2 hour 15 i would just need to call dothing(listoffiles[39]). There are unrelated complications to why i have to do it this way.
我知道可以使用本地目录轻松地做到这一点,但是hadoop使一切变得更加复杂.
I know there is a way to do this easily with local directories, but hadoop makes everything a little more complicated.
推荐答案
我推荐这个Python项目: https://github.com/mtth/hdfs 它使用HttpFS,实际上非常简单快捷.我一直在使用Kerberos的群集上使用它,并且像一个超级按钮一样工作.您只需要设置namenode或HttpFs服务URL.
I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.
这篇关于列出不带pydoop的HDFS Python中的所有文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!