Spark如何使用带下划线的文件名开头读取文件? [英] How Spark read file with underline the beginning of the file name?
问题描述
当我使用Spark解析日志文件时,我注意到如果filename的第一个字符为_
,则结果将为空.这是我的测试代码:
When I use Spark to parse log files, I notice that if the first character of filename is _
, the result will be empty. Here is my test code:
SparkSession spark = SparkSession
.builder()
.appName("TestLog")
.master("local")
.getOrCreate();
JavaRDD<String> input = spark.read().text("D:\\_event_2.log").javaRDD();
System.out.println("size : " + input.count());
如果我将文件名修改为event_2.log
,则代码将正确运行它.
我发现text
函数定义为:
If I modify the file name to event_2.log
, the code will run it correctly.
I found that the text
function is defined as:
@scala.annotation.varargs
def text(paths: String*): Dataset[String] = {
format("text").load(paths : _*).as[String](sparkSession.implicits.newStringEncoder)
}
我认为这可能是由于_
是scala的placeholder
.我该如何避免这个问题?
I think it could be due to _
being scala's placeholder
. How can I avoid this problem?
推荐答案
这与Scala无关. Spark使用Hadoop Input API读取文件,该文件将忽略每个以下划线(_
)或点(.
)开头的文件
This has nothing to do with Scala. Spark uses Hadoop Input API to read file, which ignore every file that starts with underscore(_
) or dot (.
)
我不知道如何在Spark中禁用此功能.
I don't know how to disable this in Spark though.
这篇关于Spark如何使用带下划线的文件名开头读取文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!