将二进制文件读入Spark [英] Reading binary File into Spark

查看：106 发布时间：2021/4/8 20:17:53 scala apache-spark spark-streaming binaryfiles binary-data

本文介绍了将二进制文件读入Spark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组文件，每个文件包含一个 Marc21 二进制格式的特定记录.我想将文件集作为RDD提取，其中每个元素将作为记录对象作为二进制数据.稍后，我将使用 Marc 库将对象转换为 Java Object 进行进一步处理.

I have a set of file that each contain a specific Record in Marc21 binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marc library to convert the object into Java Object for further processing.

截至目前，我对如何读取二进制文件感到困惑.

As of now, I am puzzled as to how i can read a binary file.

我已经看到了以下功能:

I have seen the following function:

binaryRecord(path: string, recordLength: int, conf)

但是，它假定它是一个文件，具有相同长度的多个记录.我的记录将肯定有不同的大小.每个文件旁边都有一个单独的文件.

However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.

有办法解决这个问题吗?如何为每个文件指定长度?唯一的方法是只计算文件的长度，然后读取记录吗?

Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?

我看到的另一种解决方案显然是读取Java格式的记录，然后将其序列化为任何可方便提取的格式.

The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.

请告知.

将二进制文件读入Spark [英] Reading binary File into Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将二进制文件读入Spark [英] Reading binary File into Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭