列出不带pydoop的HDFS Python中的所有文件 [英] List all files in HDFS Python without pydoop

查看:85
本文介绍了列出不带pydoop的HDFS Python中的所有文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在centos 6.5上运行的hadoop集群.我目前正在使用python 2.6.由于不相关的原因,我无法升级到python 2.7.由于这个不幸的事实,我无法安装pydoop.在hadoop集群中,我有大量名为raw"yearmonthdaytimehour" .txt的原始数据文件,括号中的所有内容都是数字. 有没有办法在python的hadoop目录中列出所有文件的列表?因此,程序将创建一个类似如下的列表.

I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install pydoop. Inside the hadoop cluster i have a large amount of raw data files named raw"yearmonthdaytimehour".txt everything in parenthesis is a number. Is there a way to make a list of all the files in a hadoop directory in python? So the program would create a list that looks something like.

listoffiles=['raw160317220001.txt', 'raw160317230001.txt', ....] 

这将使我需要做的一切变得容易得多,因为要从第2天15开始获取文件,我只需要调用dothing(listoffiles [39]).有无关的复杂性说明为什么我必须这样做.

It would make everything i need to do a lot easier since to get the file from day 2 hour 15 i would just need to call dothing(listoffiles[39]). There are unrelated complications to why i have to do it this way.

我知道可以使用本地目录轻松地做到这一点,但是hadoop使一切变得更加复杂.

I know there is a way to do this easily with local directories, but hadoop makes everything a little more complicated.

推荐答案

我推荐这个Python项目: https://github.com/mtth/hdfs 它使用HttpFS,实际上非常简单快捷.我一直在使用Kerberos的群集上使用它,并且像一个超级按钮一样工作.您只需要设置namenode或HttpFs服务URL.

I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.

这篇关于列出不带pydoop的HDFS Python中的所有文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆