使用python从HDFS获取文件名列表 [英] Get a list of file names from HDFS using python

查看:528
本文介绍了使用python从HDFS获取文件名列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop noob here。



我已经搜索了一些关于hadoop和python入门的教程,但没有取得太大的成功。我不需要在mappers和reducers上做任何工作,但它更多的是访问问题。



作为Hadoop集群的一部分,
存在

为了使用Python在我的客户端(本地计算机)上访问这些文件,


我的计算机上需要具备哪些功能?



如何在HDFS上查询文件名?


任何链接都会有帮助。

解决方案

您应该具有群集中节点的登录访问权限。让群集管理员选择节点并设置帐户,并通知您如何安全地访问节点。如果你是管理员,请告诉我,如果集群是本地或远程的,并且如果是远程的,那么它是托管在你的计算机上,公司内部还是第三方云上,如果是的话,那么我可以提供更多的相关信息。 / p>

要在HDFS中查询文件名,请登录到群集节点并运行 hadoop fs -ls [path] Path 是可选的,如果未提供,则会列出您的主目录中的文件。如果提供 -R 作为选项,那么它会递归地列出路径中的所有文件。此命令还有其他选项。有关此Hadoop文件系统shell命令的更多信息,请参阅 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html



在Python中查询HDFS文件名的简单方法是使用 esutil.hdfs.ls(hdfs_url ='',recurse = False,full = False),它执行 hadoop fs -ls hdfs_url 在一个子进程中,加上它具有许多其他Hadoop文件系统shell命令的功能(请参阅 http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs的.py )。可以使用 pip install esutil 来安装 esutil 。它位于 https://pypi.python.org/pypi/esutil 上的PyPI,文档因为它位于 http://code.google.com/p/esutil/ 和它的GitHub网站是 https://github.com/esheldon/esutil


Hadoop noob here.

I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.

As a part of Hadoop cluster, there are a bunch of .dat files on the HDFS.

In order to access those files on my client (local computer) using Python,

what do I need to have on my computer?

How do I query for filenames on HDFS ?

Any links would be helpful too.

解决方案

You should have login access to a node in the cluster. Let the cluster administrator pick the node and setup the account and inform you how to access the node securely. If you are the administrator, let me know if the cluster is local or remote and if remote then is it hosted on your computer, inside a corporation or on a 3rd party cloud and if so whose and then I can provide more relevant information.

To query file names in HDFS, login to a cluster node and run hadoop fs -ls [path]. Path is optional and if not provided, the files in your home directory are listed. If -R is provided as an option, then it lists all the files in path recursively. There are additional options for this command. For more information about this and other Hadoop file system shell commands see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.

An easy way to query HDFS file names in Python is to use esutil.hdfs.ls(hdfs_url='', recurse=False, full=False), which executes hadoop fs -ls hdfs_url in a subprocess, plus it has functions for a number of other Hadoop file system shell commands (see the source at http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs.py). esutil can be installed with pip install esutil. It is on PyPI at https://pypi.python.org/pypi/esutil, documentation for it is at http://code.google.com/p/esutil/ and its GitHub site is https://github.com/esheldon/esutil.

这篇关于使用python从HDFS获取文件名列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆