星火最快的方式为numpy的阵列创建RDD [英] Spark fastest way for creating RDD of numpy arrays
问题描述
我的火花应用程序是使用numpy的阵列RDD的。结果
此刻,我是从AWS S3读我的数据,并对其重新presented作为
一个简单的文本文件,其中每一行是一个向量,每个元素是由空间分隔,例如:
1 2 3
5.1 3.6 2.1
3 0.24 1.333
我为了从中创建一个numpy的阵列正在使用numpy的函数 loadtxt()
。
结果但是,这种方法似乎很慢,我的应用程序花费太多时间(我认为)我的数据转换为numpy的数组。
您能否提供我做一个更好的办法?举例来说,我应该把我的数据集作为一个二进制文件?
我应该以另一种方式创造RDD?
有些code对如何创建我的RDD:
数据= sc.textFile(s3_url,initial_num_of_partitions).mapPartitions(READDATA)
READDATA功能:
高清readPointBatch(迭代器):
返回[(np.loadtxt(迭代器,DTYPE = np.float64)
在这种情况下做的最好的事情就是用熊猫库IO。
请参考这个问题:<一href=\"http://stackoverflow.com/questions/33927320/pandas-read-csv-and-python-iterator-as-input/33986924#33986924\">pandas read_csv()和Python迭代器作为输入
。
结果在那里,你将看到如何替换 np.loadtxt()
功能,所以这将是快得多结果创造numpy的阵列的RDD。
My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as
a simple text file where each line is a vector and each element is seperated by space, for example:
1 2 3
5.1 3.6 2.1
3 0.24 1.333
I'm using numpy's function loadtxt()
in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.
Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?, should I create the RDD in another way?
Some code for how I create my RDD:
data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)
readData function:
def readPointBatch(iterator):
return [(np.loadtxt(iterator,dtype=np.float64)]
The best thing to do under these circumstances is to use pandas library for io.
Please refer to this question : pandas read_csv() and python iterator as input
.
There you will see how to replace the np.loadtxt()
function so it would be much faster to
create a RDD of numpy array.
这篇关于星火最快的方式为numpy的阵列创建RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!