使用Spark列出Hadoop HDFS目录中的所有文件? [英] Use Spark to list all files in a Hadoop HDFS directory?
问题描述
我想遍历Hadoop目录中的所有文本文件并计算单词error的所有出现次数。有没有办法做一个 hadoop fs -ls / users / ubuntu /
来用Apache Spark Scala API列出dir中的所有文件?
从给定的第一个示例, spark的上下文似乎只能通过类似的方式访问文件:
val file = spark.textFile(hdfs:// target_load_file .txt)
在我的问题中,我不知道有多少个文件的名称预先HDFS文件夹。查看激励上下文文档但找不到这种功能。
您可以使用通配符:
val errorCount = sc.textFile(hdfs:// some-directory / *)
.flatMap(_。split()) .filter(_ ==error)。count
I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/
to list all the files in a dir with the Apache Spark Scala API?
From the given first example, the spark context seems to only access files individually through something like:
val file = spark.textFile("hdfs://target_load_file.txt")
In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.
You can use a wildcard:
val errorCount = sc.textFile("hdfs://some-directory/*")
.flatMap(_.split(" ")).filter(_ == "error").count
这篇关于使用Spark列出Hadoop HDFS目录中的所有文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!