HDFS:使用HDFS API附加到SequenceFile [英] HDFS: Using HDFS API to append to a SequenceFile

查看:238
本文介绍了HDFS:使用HDFS API附加到SequenceFile的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用Java API在HDFS上创建和维护序列文件,而无需为未来的MapReduce作业运行MapReduce作业作为设置。我希望将所有MapReduce作业的输入数据存储在单个序列文件中,但是数据会随着时间的推移而被添加。问题是,如果SequenceFile存在,下面的调用将会覆盖SequenceFile而不是附加到它。

  // fs和conf是为HDFS设置的,而不是作为LocalFileSystem 
seqWriter = SequenceFile.createWriter(fs,conf,new Path(hdfsPath),
keyClass,valueClass,SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key),new BytesWritable(value));
seqWriter.close();

另一个问题是,我无法维护自己格式的文件,并将数据转换为SequenceFile在任何时候都可以使用该数据启动一天结束的MapReduce作业。



我无法找到任何其他API调用来追加到SequenceFile并保持其格式。我也不能简单地连接两个SequenceFile,因为它们的格式需要。

我也希望避免为此运行MapReduce作业,因为它对于少量数据有很高的开销我添加到SequenceFile中。



任何想法或解决方法?谢谢。

解决方案

支持附加到现有的 SequenceFiles Apache Hadoop 2.6.1和2.7.2发布,通过增强JIRA: https://问题.apache.org / jira / browse / HADOOP-7139



举例来说,测试用例可以是: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/ TestSequenceFileAppend.java#L63-L140



CDH5用户可以在版本CDH 5.7.1之后找到相同的功能。


I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.

// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
               keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();

Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.

I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.

I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.

Any thoughts or work-arounds? Thanks.

解决方案

Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139

For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140

CDH5 users can find the same ability in version CDH 5.7.1 onwards.

这篇关于HDFS:使用HDFS API附加到SequenceFile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆