如何枚举HDFS目录中的文件 [英] How to enumerate files in HDFS directory

查看:130
本文介绍了如何枚举HDFS目录中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何枚举HDFS目录中的文件?这是为了使用Scala枚举Apache Spark群集中的文件。我看到有sc.textfile()选项,但它会读取内容。我只想读取文件名。

How do I enumerate files in HDFS directory? This is for enumerating files in Apache Spark cluster using Scala. I see there is sc.textfile() option but that will read the contents as-well. I want to read only file names.

其实我试过listStatus。但没有奏效。获取下面的错误。
我使用的是Azure HDInsight Spark,blob存储文件夹testContainer@testhdi.blob.core.windows.net/example/包含.json文件。

I actually tried the listStatus. But didn't work. Get the below error. I am using Azure HDInsight Spark and the blob store folder "testContainer@testhdi.blob.core.windows.net/example/" contains .json files.

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path("wasb://testContainer@testhdi.blob.core.windows.net/example/"))
status.foreach(x=> println(x.getPath)

=========
Error:
========
java.io.FileNotFoundException: Filewasb://testContainer@testhdi.blob.core.windows.net/example does not exist.
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2076)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:23)
    at $iwC$$iwC$$iwC.<init>(<console>:28)
    at $iwC$$iwC.<init>(<console>:30)
    at $iwC.<init>(<console>:32)
    at <init>(<console>:34)
    at .<init>(<console>:38)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
    at scala.Console$.withOut(Console.scala:126)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLine(SparkInterpreter.scala:271)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLines(SparkInterpreter.scala:246)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.execute(SparkInterpreter.scala:104)
    at com.cloudera.livy.repl.Session.com$cloudera$livy$repl$Session$$executeCode(Session.scala:98)
    at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
    at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

谢谢!

Thanks!

推荐答案

这是失败的原因是因为它实际上正在查看默认存储容器而不是testContainer,因此找不到示例文件夹。您可以通过将路径更改为wasb://testContainer@testhdi.blob.core.windows.net/来查看它,并且它将列出来自不同容器的文件。

The reason this is failing is because it is actually looking in your the default storage container rather than the testContainer, and thus not finding the example folder. You can see this by changing the path to wasb://testContainer@testhdi.blob.core.windows.net/ and it will list files from a different container.

我不知道这是为什么,但我发现你可以通过将路径传递给FileSystem.get调用来修复它:

I don't know why this is, but I discovered you can fix it by passing the path to the FileSystem.get call like this:

val fs = FileSystem.get(new java.net.URI("wasb://testContainer@testhdi.blob.core.windows.net/example/"), new Configuration())
val status = fs.listStatus(new Path("wasb://testContainer@testhdi.blob.core.windows.net/example/"))
status.foreach(x=> println(x.getPath)

这篇关于如何枚举HDFS目录中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆