Hadoop HDFS:读取正在写入的序列文件 [英] Hadoop HDFS: Read sequence files that are being written

查看:575
本文介绍了Hadoop HDFS:读取正在写入的序列文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我将日志写入Hadoop序列文件到HDFS中,我在每一串日志之后调用syncFS(),但是我从来没有关闭文件(除了我正在执行日常滚动时)。



我想保证的是文件在文件仍在写入时可供读者使用。



我可以通过FSDataInputStream读取序列文件的字节,但是如果我尝试使用SequenceFile.Reader.next(key,val),它会在第一次调用时返回false 。



我知道数据在文件中,因为我可以用FSDataInputStream或cat命令读取它,我100%肯定syncFS()被调用。 / p>

我检查了namenode和datanode日志,没有任何错误或警告。



为什么SequenceFile.Reader无法读取我目前正在编写的文件?

解决方案

无法确保读取完全写入数据节点端的磁盘。您可以在 DFSClient#DFSOutputStream.sync()的文档中看到以下内容:

 所有数据都写入datanodes。不能保证数据已经被刷新到datanode上的持久性存储中。块分配是
在namenode上持久化的。

因此,它基本上用当前信息更新namenode的块映射并将数据发送到datanode。由于您无法将数据刷新到datanode上的磁盘,但是您直接从数据节点读取数据,因此您需要在某个时间段内将数据缓存到某处并且无法访问。因此,您的序列文件读取器会认为数据流已完成(或为空),并且无法读取将更多字节返回到反序列化过程的错误。

数据节点将数据写入磁盘(预先写入数据,但不能从外部读取),如果数据块完全收到。因此,一旦达到块大小或事先关闭了文件,就可以从文件中读取数据,从而最终确定块。在分布式环境中,这是完全合理的,因为你的作者可能会死,而不是正确地完成一个块 - 这是一致性的问题。

所以修复将是使块大小非常小,因此块更经常完成。但是效率并不高,我希望你的需求清楚地表明你的需求不适合HDFS。


I am using Hadoop 1.0.3.

I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling).

What I want to guarantee is that the file is available to readers while the file is still being written.

I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call.

I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called.

I checked the namenode and datanode logs, no error or warning.

Why SequenceFile.Reader is unable to read my currently being written file ?

解决方案

You can't ensure that a read is completely written to disk on the datanode side. You can see this in the documentation of DFSClient#DFSOutputStream.sync() which states:

  All data is written out to datanodes. It is not guaranteed that data has
  been flushed to persistent store on the datanode. Block allocations are
  persisted on namenode.

So it basically updates the the namenode's block map with the current information and sends the data to the datanode. Since you can't flush the data to disk on the datanode, but you directly read from the datanode you hit a timeframe where the data is somewhere buffered and not accessible. Thus your sequencefile reader will think that the datastream is finished (or empty) and can't read additional bytes returning false to the deserialization process.

A datanode writes the data to disk (it is written beforehand, but not readable from outside) if the block is fully received. So you are able to read from the file once your blocksize has been reached or your file has been closed beforehand and thus finalized a block. Which totally makes sense in a distributed environment, because your writer can die and not finish a block properly- this is a matter of consistency.

So the fix would be to make the blocksize very small so the block is finished more often. But that is not so efficient and I hope it should be clear that your requirement is not suited for HDFS.

这篇关于Hadoop HDFS:读取正在写入的序列文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆