将二进制文件读入Spark [英] Reading binary File into Spark

查看:106
本文介绍了将二进制文件读入Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文件,每个文件包含一个 Marc21 二进制格式的特定记录.我想将文件集作为RDD提取,其中每个元素将作为记录对象作为二进制数据.稍后,我将使用 Marc 库将对象转换为 Java Object 进行进一步处理.

I have a set of file that each contain a specific Record in Marc21 binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marc library to convert the object into Java Object for further processing.

截至目前,我对如何读取二进制文件感到困惑.

As of now, I am puzzled as to how i can read a binary file.

我已经看到了以下功能:

I have seen the following function:

binaryRecord(path: string, recordLength: int, conf)

但是,它假定它是一个文件,具有相同长度的多个记录.我的记录将肯定有不同的大小.每个文件旁边都有一个单独的文件.

However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.

有办法解决这个问题吗?如何为每个文件指定长度?唯一的方法是只计算文件的长度,然后读取记录吗?

Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?

我看到的另一种解决方案显然是读取Java格式的记录,然后将其序列化为任何可方便提取的格式.

The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.

请告知.

推荐答案

您是否尝试过Spark的sc.binaryFiles()?

Have you tried sc.binaryFiles() from spark?

这里是文档的链接 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆