获取子目录的列表 [英] Get a list of subdirectories

查看：316 发布时间：2018/6/1 12:43:58 python hadoop apache-spark hdfs bigdata

本文介绍了获取子目录的列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

data = sc.textFile（'/ hadoop_foo / a'） data.count（） 240 data = sc.textFile（'/ hadoop_foo / *'） data.count（） 168129
但是，我想要计算每个/ hadoop_foo /子目录的数据大小。我能做到吗？

换句话说，我想要的是这样的：

data sc.textFile（subdir）
data.count（）

我尝试过：

<$ p $在[9]中：[x [0]为os.walk中的x（/ hadoop_foo /）]
出[9]：[]

但我认为失败了，因为它在驱动程序的本地目录（在这种情况下是网关）搜索，而/ hadoop_foo /位于 hdfs 的问题。同样的hdfs：/// hadoop_foo /。

阅读完如何为HDFS递归列出子目录？，我想知道是否有一种方法可以执行：
hadoop dfs -lsr / hadoop_foo /
在代码中。

From 正确的将两个浮点数写入普通文本文件的方法：

在[28]中：os.getcwd（） Out [28]：'/ homes / gsamaras'< - 这是我的本地目录

解决方案
使用python使用 hdfs 模块; 步行（）方法可以让你列出文件。

代码看起来像这样：
from hdfs import InsecureClient client = InsecureClient（'http：// host：port'，user ='user'） for client.walk（dir，0，True） b $ b ...
使用Scala可以获得文件系统（ val fs = FileSystem.get（new Configuration（）））并运行 https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/ FileSystem.html＃listFiles（org.apache.hadoop.fs.Path ，布尔值）

您也可以使用 os.subprocess 但这绝不是推荐因为你依赖shell实用程序的文本输出。

最后，OP的工作原理是使用<一个href =/ hadoop_foo /> subprocess.check_output（）：
$ b
子目录= subprocess.check_output（[ hadoop，fs， - ls，/ hadoop_foo /]）

I know I can do this:
data = sc.textFile('/hadoop_foo/a') data.count() 240 data = sc.textFile('/hadoop_foo/*') data.count() 168129
However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that?

In other words, what I want is something like this:
subdirectories = magicFunction() for subdir in subdirectories: data sc.textFile(subdir) data.count()

I tried with:
In [9]: [x[0] for x in os.walk("/hadoop_foo/")] Out[9]: []
but I think that fails, because it searches at the local directory of the driver (the gateway in that case), while "/hadoop_foo/" lies in the hdfs. Same for "hdfs:///hadoop_foo/".

After reading How can I list subdirectories recursively for HDFS?, I am wondering if there is a way to execute:
hadoop dfs -lsr /hadoop_foo/
in code..

From Correct way of writing two floats into a regular txt:
In [28]: os.getcwd() Out[28]: '/homes/gsamaras' <-- which is my local directory

解决方案
With python use hdfs module; walk() method can get you list of files.

The code sould look something like this:
from hdfs import InsecureClient client = InsecureClient('http://host:port', user='user') for stuff in client.walk(dir, 0, True): ...
With Scala you can get the filesystem (val fs = FileSystem.get(new Configuration())) and run https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)

You can also execute a shell command from your script with os.subprocess but this is never a recommended approach since you depend on text output of a shell utility here.

Eventually, what worked for the OP was using subprocess.check_output():
subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])

这篇关于获取子目录的列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

获取子目录的列表 [英] Get a list of subdirectories

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

获取子目录的列表 [英] Get a list of subdirectories

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭