如何在HDFS(Spark 2.11)中附加到同一文件 [英] How can I append to same file in HDFS(spark 2.11)

查看:174
本文介绍了如何在HDFS(Spark 2.11)中附加到同一文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SparkStreaming将流数据存储到HDFS中,但是它会继续在 新文件 中创建,以将其追加到一个文件或几个多个文件中

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files

如果继续创建n个文件,我觉得效率不高

If it keep creating n numbers of files,i feel it won't be much efficient

HDFS文件系统

HDFS FILE SYSYTEM

代码

Code

lines.foreachRDD(f => {
  if (!f.isEmpty()) {
    val df = f.toDF().coalesce(1)
    df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
  }
 })

在pom中,我正在使用各自的依赖项:

In my pom I am using respective dependencies:

  • spark-core_2.11
  • spark-sql_2.11
  • spark-streaming_2.11
  • spark-streaming-kafka-0-10_2.11

推荐答案

您已经在Spark中意识到Append意味着写到现有目录而不是追加到文件.

As you already realized Append in Spark means write-to-existing-directory not append-to-file.

这是故意的和期望的行为(请考虑,即使格式和文件系统允许,如果过程在追加"过程中失败,将会发生什么情况.

This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).

诸如合并文件之类的操作应在必要时通过单独的过程应用,以确保正确性和容错能力.不幸的是,这需要完整的副本,出于明显的原因,逐个批次是不希望的.

Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.

这篇关于如何在HDFS(Spark 2.11)中附加到同一文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆