星火最快的方式为numpy的阵列创建RDD [英] Spark fastest way for creating RDD of numpy arrays

查看：574 发布时间：2016/5/22 15:27:04 python numpy apache-spark pyspark rdd

本文介绍了星火最快的方式为numpy的阵列创建RDD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的火花应用程序是使用numpy的阵列RDD的。结果
此刻，我是从AWS S3读我的数据，并对其重新presented作为
一个简单的文本文件，其中每一行是一个向量，每个元素是由空间分隔，例如：

  1 2 3
5.1 3.6 2.1
3 0.24 1.333

我为了从中创建一个numpy的阵列正在使用numpy的函数 loadtxt（）。
结果但是，这种方法似乎很慢，我的应用程序花费太多时间（我认为）我的数据转换为numpy的数组。

您能否提供我做一个更好的办法？举例来说，我应该把我的数据集作为一个二进制文件？
我应该以另一种方式创造RDD？

有些code对如何创建我的RDD：

 数据= sc.textFile（s3_url，initial_num_of_partitions）.mapPartitions（READDATA）

READDATA功能：

 高清readPointBatch（迭代器）：
     返回[（np.loadtxt（迭代器，DTYPE = np.float64）

解决方案

在这种情况下做的最好的事情就是用熊猫库IO。
请参考这个问题：<一href=\"http://stackoverflow.com/questions/33927320/pandas-read-csv-and-python-iterator-as-input/33986924#33986924\">pandas read_csv（）和Python迭代器作为输入
。
结果在那里，你将看到如何替换 np.loadtxt（）功能，所以这将是快得多结果创造numpy的阵列的RDD。

My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as a simple text file where each line is a vector and each element is seperated by space, for example:

1 2 3
5.1 3.6 2.1
3 0.24 1.333

I'm using numpy's function loadtxt() in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.

Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?, should I create the RDD in another way?

Some code for how I create my RDD:

data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)

readData function:

 def readPointBatch(iterator):
     return [(np.loadtxt(iterator,dtype=np.float64)]

解决方案

The best thing to do under these circumstances is to use pandas library for io.
Please refer to this question : pandas read_csv() and python iterator as input .
There you will see how to replace the np.loadtxt() function so it would be much faster to
create a RDD of numpy array.

这篇关于星火最快的方式为numpy的阵列创建RDD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火最快的方式为numpy的阵列创建RDD [英] Spark fastest way for creating RDD of numpy arrays

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

星火最快的方式为numpy的阵列创建RDD [英] Spark fastest way for creating RDD of numpy arrays

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭