Pyspark:获取HDFS路径上的文件/目录列表 [英] Pyspark: get list of files/directories on HDFS path

查看:2507
本文介绍了Pyspark:获取HDFS路径上的文件/目录列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如标题所示。我知道textFile,但顾名思义,它只适用于文本文件。
我需要访问HDFS(或本地路径)路径中的文件/目录。我使用pyspark



感谢您的帮助

我相信将Spark视为一种数据处理工具是有帮助的,其中一个域开始加载数据。它可以读取多种格式,并且支持Hadoop glob表达式,这对于从HDFS中的多个路径读取非常有用,但它没有我知道的用于遍历目录或文件的内置工具,也没有特定于与Hadoop或HDFS进行交互的实用程序。



有几种可用的工具可以执行所需操作,包括 esutil hdfs 。 hdfs lib支持CLI和API,你可以直接跳到'如何在Python中列出HDFS文件'。这里。它看起来像这样:

from hdfs import Config
client = Config()。 get_client('dev')
files = client.list('the_dir_path')


As in title. I' m aware of textFile but as the name suggests, it works only on text file. I would need to access the files/directories inside a path on HDFS (or local path). I'm using pyspark

Thanks for help

解决方案

I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:

from hdfs import Config
client = Config().get_client('dev')
files = client.list('the_dir_path')

这篇关于Pyspark:获取HDFS路径上的文件/目录列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆