如何在Spark中将二进制文件传输到rdd? [英] How to transfer binary file into rdd in spark?

查看:131
本文介绍了如何在Spark中将二进制文件传输到rdd?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将seg-Y类型文件加载到spark中,并将它们传输到rdd中以进行mapreduce操作.但是我没有将它们转移到rdd中.有谁可以提供帮助吗?

I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help?

推荐答案

您可以使用binaryRecords()pySpark调用将二进制文件的内容转换为RDD

You could use binaryRecords() pySpark call to convert binary file's content into an RDD

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords

binaryRecords(path,recordLength)

binaryRecords(path, recordLength)

从平面二进制文件,假设每个记录是一组数字,并且指定的数字格式(请参见ByteBuffer)以及字节数每条记录是恒定的.

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

参数:path –输入数据文件的目录recordLength –记录分割的长度

Parameters: path – Directory to the input data files recordLength – The length at which to split the records

然后,您可以使用例如struct.unpack()将RDD映射()到一个结构中

Then you could map() that RDD into a structure by using, for example, struct.unpack()

https://docs.python.org/2/library/struct.html

我们使用这种方法来获取建议性的固定宽度记录二进制文件.有一些Python代码可以生成Format字符串( struct.unpack 的第一个参数),但是如果您的文件布局是静态的,则一次手动操作相当简单.

We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack), but if your files layout is static, it's fairly simple to do manually one time.

使用纯Scala同样可以做到:

Similarly is possible to do using pure Scala:

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆