如何在压缩的avro文件中获取每个avro记录的开始结束和结束? [英] How to get start end and end of each avro record in a compressed avro file?

查看:298
本文介绍了如何在压缩的avro文件中获取每个avro记录的开始结束和结束?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是这个。我有一个2GB的snappy压缩avro文件,在HDFS上存储了大约1000个avro记录。我知道我可以编写代码来打开这个avro文件并打印出每个avro记录。我的问题是,在java中有没有办法说,打开这个avro文件,遍历每个记录并输出到该文件文件中该avro文件中每条记录的起始位置和结束位置,以便...我可以有一个java函数调用readRecord(startposition,endposition),它可以采用startposition和endposition来快速读出一个特定的avro记录,而不必遍历整个文件?

My problem is this. I have a snappy compressed avro file of 2GB with about 1000 avro records stored on HDFS. I know I can write code to "open up this avro file" and print out each avro record. My question is, is there a way in java to say, open up this avro file, iterate through each record and output into a text file the "start position" and "end position" of each record within that avro file such that... I could have a java function call "readRecord(startposition, endposition)" that could take the startposition and endposition to quickly read out one specific avro record without having to iterate through the whole file?

推荐答案

我没有时间为您提供现成的实施,但我认为我可以为您提供一些提示。

I don't have time to provide you an off-the-shelf implementation but I think that I can provide you some hints.

让我们从Avro规范开始:对象容器文件

Let's start with the Avro Specification: Object Container Files

基本上,Avro文件是一套包含一个或多个记录的自包含块(您可以配置大小块和记录永远不会被分成两个街区)。在每个块的开头,您会找到:

Basically a Avro file is a suite of self-contained blocks containing one or more records (you can configure the size block and a record will never be split across two blocks). At the beginning of each block you find:


  • 一个long,表示此块中对象的数量。

  • 在应用任何编解码器后,指示当前块中序列化对象的大小(以字节为单位)的长度

  • 序列化对象。如果指定了编解码器,则会被该编解码器压缩。

  • 文件的16字节同步标记。

  • A long indicating the count of objects in this block.
  • A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
  • The serialized objects. If a codec is specified, this is compressed by that codec.
  • The file's 16-byte sync marker.

文档明确指出因此,可以有效地提取或跳过每个块的二进制数据,而无需对内容进行反序列化。块大小,对象计数和同步标记的组合可以检测损坏的块并帮助确保数据完整性。

您不能直接寻找特定记录,但您可以寻找给定的块然后迭代其对象。这不完全是你需要的,但似乎足够接近。我相信你将无法比使用Avro容器做得更好。您仍然可以调整块大小以最大限度地限制块内的迭代次数。当使用压缩时,它会在块级别应用,因此不会出现问题。

You cannot directly seek to a specific record, but you can seek to a given block then iterate over its objects. It is not exactly what you need, but seems close enough. I believe that you won't be able to do much better than that with Avro containers. You can still tweak the block size to bound maximum the number of iteration within a block. When compression is used, it is applied at block level so it won't be an issue.

我相信只有使用公共Avro API才能实现这样的读取器( FileDataReader 提供搜索同步方法等。)

I believe that a such reader can be implemented using only public Avro API (FileDataReader provides seek and sync methods etc.)

这篇关于如何在压缩的avro文件中获取每个avro记录的开始结束和结束?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆