获取子目录的列表 [英] Get a list of subdirectories

查看:316
本文介绍了获取子目录的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



  data = sc.textFile('/ hadoop_foo / a')
data.count()
240
data = sc.textFile('/ hadoop_foo / *')
data.count()
168129

但是,我想要计算每个/ hadoop_foo /子目录的数据大小。我能做到吗?



换句话说,我想要的是这样的:

 
data sc.textFile(subdir)
data.count()





我尝试过:

<$ p $在[9]中:[x [0]为os.walk中的x(/ hadoop_foo /)]
出[9]:[]

但我认为失败了,因为它在驱动程序的本地目录(在这种情况下是网关)搜索,而/ hadoop_foo /位于 hdfs 的问题。同样的hdfs:/// hadoop_foo /。






阅读完如何为HDFS递归列出子目录?,我想知道是否有一种方法可以执行:

  hadoop dfs -lsr / hadoop_foo / 

在代码中。




From 正确的将两个浮点数写入普通文本文件的方法

 在[28]中:os.getcwd()
Out [28]:'/ homes / gsamaras'< - 这是我的本地目录


解决方案

使用python使用 hdfs 模块; 步行()方法可以让你列出文件。



代码看起来像这样:

  from hdfs import InsecureClient 

client = InsecureClient('http:// host:port',user ='user')
for client.walk(dir,0,True) b $ b ...

使用Scala可以获得文件系统( val fs = FileSystem.get(new Configuration()))并运行 https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/ FileSystem.html#listFiles(org.apache.hadoop.fs.Path ,布尔值)

您也可以使用 os.subprocess 但这绝不是推荐因为你依赖shell实用程序的文本输出。






最后,OP的工作原理是使用<一个href =/ hadoop_foo /> subprocess.check_output():
$ b

 子目录= subprocess.check_output([ hadoop,fs, -  ls,/ hadoop_foo /])


I know I can do this:

data = sc.textFile('/hadoop_foo/a')
data.count()
240
data = sc.textFile('/hadoop_foo/*')
data.count()
168129

However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that?

In other words, what I want is something like this:

subdirectories = magicFunction()
for subdir in subdirectories:
  data sc.textFile(subdir)
  data.count()


I tried with:

In [9]: [x[0] for x in os.walk("/hadoop_foo/")]
Out[9]: []

but I think that fails, because it searches at the local directory of the driver (the gateway in that case), while "/hadoop_foo/" lies in the . Same for "hdfs:///hadoop_foo/".


After reading How can I list subdirectories recursively for HDFS?, I am wondering if there is a way to execute:

hadoop dfs -lsr /hadoop_foo/

in code..


From Correct way of writing two floats into a regular txt:

In [28]: os.getcwd()
Out[28]: '/homes/gsamaras'  <-- which is my local directory

解决方案

With python use hdfs module; walk() method can get you list of files.

The code sould look something like this:

from hdfs import InsecureClient

client = InsecureClient('http://host:port', user='user')
for stuff in client.walk(dir, 0, True):
...

With Scala you can get the filesystem (val fs = FileSystem.get(new Configuration())) and run https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)

You can also execute a shell command from your script with os.subprocess but this is never a recommended approach since you depend on text output of a shell utility here.


Eventually, what worked for the OP was using subprocess.check_output():

subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])

这篇关于获取子目录的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆