HDFS：使用HDFS API附加到SequenceFile [英] HDFS: Using HDFS API to append to a SequenceFile

查看：238 发布时间：2018/5/31 19:33:47 hadoop hdfs

本文介绍了HDFS：使用HDFS API附加到SequenceFile的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试使用Java API在HDFS上创建和维护序列文件，而无需为未来的MapReduce作业运行MapReduce作业作为设置。我希望将所有MapReduce作业的输入数据存储在单个序列文件中，但是数据会随着时间的推移而被添加。问题是，如果SequenceFile存在，下面的调用将会覆盖SequenceFile而不是附加到它。

  // fs和conf是为HDFS设置的，而不是作为LocalFileSystem 
 seqWriter = SequenceFile.createWriter（fs，conf，new Path（hdfsPath），
 keyClass，valueClass，SequenceFile.CompressionType.NONE）; 
 seqWriter.append（new Text（key），new BytesWritable（value））; 
 seqWriter.close（）;

另一个问题是，我无法维护自己格式的文件，并将数据转换为SequenceFile在任何时候都可以使用该数据启动一天结束的MapReduce作业。

我无法找到任何其他API调用来追加到SequenceFile并保持其格式。我也不能简单地连接两个SequenceFile，因为它们的格式需要。

我也希望避免为此运行MapReduce作业，因为它对于少量数据有很高的开销我添加到SequenceFile中。

任何想法或解决方法？谢谢。

解决方案

支持附加到现有的 SequenceFiles Apache Hadoop 2.6.1和2.7.2发布，通过增强JIRA： https：//问题.apache.org / jira / browse / HADOOP-7139

举例来说，测试用例可以是： https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/ TestSequenceFileAppend.java＃L63-L140

CDH5用户可以在版本CDH 5.7.1之后找到相同的功能。

I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.

// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
               keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();

Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.

I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.

I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.

Any thoughts or work-arounds? Thanks.

解决方案

Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139

For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140

CDH5 users can find the same ability in version CDH 5.7.1 onwards.

这篇关于HDFS：使用HDFS API附加到SequenceFile的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HDFS：使用HDFS API附加到SequenceFile [英] HDFS: Using HDFS API to append to a SequenceFile

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

HDFS：使用HDFS API附加到SequenceFile [英] HDFS: Using HDFS API to append to a SequenceFile

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭