Pyspark:获取HDFS路径上的文件/目录列表 [英] Pyspark: get list of files/directories on HDFS path
问题描述
我需要访问HDFS(或本地路径)路径中的文件/目录。我使用pyspark
感谢您的帮助
我相信将Spark视为一种数据处理工具是有帮助的,其中一个域开始加载数据。它可以读取多种格式,并且支持Hadoop glob表达式,这对于从HDFS中的多个路径读取非常有用,但它没有我知道的用于遍历目录或文件的内置工具,也没有特定于与Hadoop或HDFS进行交互的实用程序。有几种可用的工具可以执行所需操作,包括 esutil 和 hdfs 。 hdfs lib支持CLI和API,你可以直接跳到'如何在Python中列出HDFS文件'。这里。它看起来像这样:
from hdfs import Config
client = Config()。 get_client('dev')
files = client.list('the_dir_path')
As in title. I' m aware of textFile but as the name suggests, it works only on text file. I would need to access the files/directories inside a path on HDFS (or local path). I'm using pyspark
Thanks for help
I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:
from hdfs import Config
client = Config().get_client('dev')
files = client.list('the_dir_path')
这篇关于Pyspark:获取HDFS路径上的文件/目录列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!