在Apache Spark中访问以下划线开头的文件 [英] Access files that start with underscore in apache spark
问题描述
我正在尝试访问Apache Spark中以 _
开头的s3上的gz文件.不幸的是,spark认为这些文件不可见,并返回输入路径不存在:s3n:.../_ 1013.gz
.如果我删除下划线,它会找到文件.
I am trying to access gz files on s3 that start with _
in Apache Spark. Unfortunately spark deems these files invisible and returns Input path does not exist: s3n:.../_1013.gz
. If I remove the underscore it finds the file just fine.
我尝试将自定义PathFilter添加到hadoopConfig:
I tried adding a custom PathFilter to the hadoopConfig:
package CustomReader
import org.apache.hadoop.fs.{Path, PathFilter}
class GFilterZip extends PathFilter {
override def accept(path: Path): Boolean = {
true
}
}
// in spark settings
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[CustomReader.GFilterZip], classOf[org.apache.hadoop.fs.PathFilter])
但是我仍然有同样的问题.有什么想法吗?
but I still have the same problem. Any ideas?
系统:带有Hadoop 2.3的Apache Spark 1.6.0
System: Apache Spark 1.6.0 with Hadoop 2.3
推荐答案
以_和开头的文件.是隐藏文件.
Files started with _ and . are hidden files.
并且hiddenFileFilter将始终被应用.它被添加到方法 org.apache.hadoop.mapred.FileInputFormat.listStatus
And the hiddenFileFilter will be always applied. It is added inside method org.apache.hadoop.mapred.FileInputFormat.listStatus
检查此答案,映射器忽略哪些文件作为输入?
这篇关于在Apache Spark中访问以下划线开头的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!