如何在Spark中将二进制文件传输到rdd? [英] How to transfer binary file into rdd in spark?
问题描述
我正在尝试将seg-Y类型文件加载到spark中,并将它们传输到rdd中以进行mapreduce操作.但是我没有将它们转移到rdd中.有谁可以提供帮助吗?
I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help?
推荐答案
您可以使用binaryRecords()pySpark调用将二进制文件的内容转换为RDD
You could use binaryRecords() pySpark call to convert binary file's content into an RDD
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
binaryRecords(path,recordLength)
binaryRecords(path, recordLength)
从平面二进制文件,假设每个记录是一组数字,并且指定的数字格式(请参见ByteBuffer)以及字节数每条记录是恒定的.
Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
参数:path –输入数据文件的目录recordLength –记录分割的长度
Parameters: path – Directory to the input data files recordLength – The length at which to split the records
然后,您可以使用例如struct.unpack()将RDD映射()到一个结构中
Then you could map() that RDD into a structure by using, for example, struct.unpack()
https://docs.python.org/2/library/struct.html
我们使用这种方法来获取建议性的固定宽度记录二进制文件.有一些Python代码可以生成Format字符串( struct.unpack
的第一个参数),但是如果您的文件布局是静态的,则一次手动操作相当简单.
We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack
), but if your files layout is static, it's fairly simple to do manually one time.
使用纯Scala同样可以做到:
Similarly is possible to do using pure Scala: