Pyspark:获取 HDFS 路径上的文件/目录列表 [英] Pyspark: get list of files/directories on HDFS path

查看:70
本文介绍了Pyspark:获取 HDFS 路径上的文件/目录列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据标题.我知道 textFile 但是,顾名思义,它只适用于文本文件.我需要访问 HDFS 或本地路径上的路径内的文件/目录.我正在使用 pyspark.

As per title. I'm aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.

推荐答案

我相信将 Spark 仅视为一种数据处理工具会很有帮助,它的域从加载数据开始.它可以读取多种格式,并且支持 Hadoop glob 表达式,这对于从 HDFS 中的多个路径读取非常有用,但它没有我所知道的用于遍历目录或文件的内置工具,也没有特定于与 Hadoop 或 HDFS 交互的实用程序.

I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

有一些可用的工具可以做你想做的事,包括 esutilhdfs.hdfs 库支持 CLI 和 API,您可以直接跳转到如何在 Python 中列出 HDFS 文件"右侧 此处.它看起来像这样:

There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:

from hdfs import Config
client = Config().get_client('dev')
files = client.list('the_dir_path')

这篇关于Pyspark:获取 HDFS 路径上的文件/目录列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆