如何在HDFS(Spark 2.11)中附加到同一文件 [英] How can I append to same file in HDFS(spark 2.11)
问题描述
我正在尝试使用SparkStreaming将流数据存储到HDFS中,但是它会继续在 新文件 中创建,以将其追加到一个文件或几个多个文件中
I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files
如果继续创建n个文件,我觉得效率不高
If it keep creating n numbers of files,i feel it won't be much efficient
HDFS文件系统
HDFS FILE SYSYTEM
代码
Code
lines.foreachRDD(f => {
if (!f.isEmpty()) {
val df = f.toDF().coalesce(1)
df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
}
})
在pom中,我正在使用各自的依赖项:
In my pom I am using respective dependencies:
- spark-core_2.11
- spark-sql_2.11
- spark-streaming_2.11
- spark-streaming-kafka-0-10_2.11
推荐答案
您已经在Spark中意识到Append
意味着写到现有目录而不是追加到文件.
As you already realized Append
in Spark means write-to-existing-directory not append-to-file.
这是故意的和期望的行为(请考虑,即使格式和文件系统允许,如果过程在追加"过程中失败,将会发生什么情况.
This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).
诸如合并文件之类的操作应在必要时通过单独的过程应用,以确保正确性和容错能力.不幸的是,这需要完整的副本,出于明显的原因,逐个批次是不希望的.
Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.
这篇关于如何在HDFS(Spark 2.11)中附加到同一文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!