Spark Streaming - 处理二进制数据文件 [英] Spark Streaming - processing binary data file
问题描述
我使用的是 pyspark 1.6.0.
我有从 AWS S3 存储桶读取二进制数据文件的现有 pyspark 代码.其他 Spark/Python 代码会解析数据中的位,转换成 int、string、boolean 等.每个二进制文件有一条数据记录.
在 PYSPARK 中,我使用以下命令读取二进制文件:sc.binaryFiles("s3n://.......")
这很好用,因为它提供了一个(文件名和数据)的元组,但我试图找到一个等效的 PYSPARK 流 API 来读取二进制文件作为流(希望文件名,如果可以的话).>
我试过:binaryRecordsStream(directory, recordLength)
但我无法让它工作......
谁能分享一些 PYSPARK 流式读取二进制数据文件的方法?
在 Spark Streaming 中,相关概念是 fileStream API,它在 Scala 和 Java 中可用,但在 Python 中不可用 - 在文档中注明:
I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried: binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
这篇关于Spark Streaming - 处理二进制数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!