numpy数组:快速填充和提取数据 [英] numpy arrays: filling and extracting data quickly

查看:156
本文介绍了numpy数组:快速填充和提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请参阅此问题底部的重要说明.

我正在使用numpy加快某些经度/纬度坐标的处理.不幸的是,我的numpy优化"使我的代码运行时间比不使用numpy的运行时间慢了约5倍.

I am using numpy to speed up some processing of longitude/latitude coordinates. Unfortunately, my numpy "optimizations" made my code run about 5x more slowly than it ran without using numpy.

瓶颈似乎在于用我的数据填充numpy数组,然后在完成数学转换后提取出这些数据.要填充数组,我基本上有一个像这样的循环:

The bottleneck seems to be in filling the numpy array with my data, and then extracting out that data after I have done the mathematical transformations. To fill the array I basically have a loop like:

point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )

for point_index in xrange( 0, n ):
    point_buffer[ point_index ] = point_list[ point_index ]

该循环非常慢,比没有numpy的整个计算要慢得多,它甚至在对其进行运算之前就填充了numpy数组. (也就是说,这不仅是python循环本身的慢,而且在将每个小数据块从python传输到numpy时,显然存在巨大的开销.)另一端也有类似的慢;处理完numpy数组后,我再次以循环方式访问每个修改后的坐标对

That loop, just filling in the numpy array before even operating on it, is extremely slow, much slower than the entire computation was without numpy. (That is, it's not just the slowness of the python loop itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) There is similar slowness on the other end; after I have processed the numpy arrays, I access each modified coordinate pair in a loop, again as

some_python_tuple = point_buffer[ index ]

再次循环以提取数据比没有numpy的整个原始计算要慢得多.那么,如何实际填充numpy数组并从numpy数组中提取数据,而又不会违反首先使用numpy的目的?

Again that loop to pull the data out is much slower than the entire original computation without numpy. So, how do I actually fill the numpy array and extract data from the numpy array in a way that doesn't defeat the purpose of using numpy in the first place?

我正在使用C库从形状文件中读取数据,该库将数据作为常规python列表提供给我.我知道,如果库将坐标传递给我,并且已经存在于numpy数组中,则无需填充" numpy数组.但是不幸的是,对我而言,数据的起点是作为常规的python列表.更重要的是,总的来说,我想了解如何用python中的数据快速填充一个numpy数组.

I am reading the data from a shape file using a C library that hands me the data as a regular python list. I understand that if the library handed me the coordinates already in a numpy array there would be no "filling" of the numpy array necessary. But unfortunately the starting point for me with the data is as a regular python list. And more to the point, in general I want to understand how you quickly fill a numpy array with data from within python.

说明

上面显示的循环实际上被简化了.我在这个问题中这样写是因为我想专注于我所看到的试图在循环中缓慢填充一个numpy数组的问题.我现在知道这样做的速度很慢.

The loop shown above is actually oversimplified. I wrote it that way in this question because I wanted to focus on the problem I was seeing of trying to fill a numpy array slowly in a loop. I now understand that doing that is just slow.

在我的实际应用程序中,我拥有的是坐标点的形状文件,并且我有一个API可以检索给定对象的点.有大约200,000个对象.因此,我反复调用函数GetShapeCoords( i )以获得对象i的坐标.这会返回一个列表列表,其中每个子列表都是lon/lat对列表,而之所以是列表列表是因为某些对象是多部分的(即多多边形).然后,在我的原始代码中,当我读取每个对象的点时,我正在通过调用常规python函数在每个点上进行转换,然后使用PIL绘制转换后的点.整个过程花费了大约20秒的时间来绘制所有200,000个多边形.并不可怕,但仍有很大的改进空间.我注意到这20秒中至少有一半是花在了转换逻辑上的,所以我想我应该以numpy的方式来做.我最初的实现是一次读取一个对象,并将子列表中的所有点附加到一个大的numpy数组中,然后我可以在numpy中进行数学运算.

In my actual application what I have is a shape file of coordinate points, and I have an API to retrieve the points for a given object. There are something like 200,000 objects. So I repeatedly call a function GetShapeCoords( i ) to get the coords for object i. This returns a list of lists, where each sublist is a list of lon/lat pairs, and the reason it's a list of lists is that some of the objects are multi-part (i.e., multi-polygon). Then, in my original code, as I read in each object's points, I was doing a transformation on each point by calling a regular python function, and then plotting the transformed points using PIL. The whole thing took about 20 seconds to draw all 200,000 polygons. Not terrible, but much room for improvement. I noticed that at least half of those 20 seconds were spent doing the transformation logic, so I thought I'd do that in numpy. And my original implementation was just to read in the objects one at a time, and keep appending all the points from the sublists into one big numpy array, which I then could do the math stuff on in numpy.

因此,我现在了解到,仅将整个python列表传递给numpy是设置大数组的正确方法.但就我而言,一次只能读取一个对象.因此,我可以做的一件事就是将列表中的点继续添加到一个大型python列表中.然后,当我以这种方式编译了许多对象的点(例如10000个对象)时,我可以简单地将该怪物列表分配给numpy.

So, I now understand that simply passing a whole python list to numpy is the right way to set up a big array. But in my case I only read one object at a time. So one thing I could do is keep appending points together in a big python list of lists of lists. And then when I've compiled some large number of objects' points in this way (say, 10000 objects), I could simply assign that monster list to numpy.

所以我的问题现在分为三个部分:

So my question now is three parts:

(a)numpy是否可以容纳不规则形状的大型列表列表,并可以将其正确地吞噬?

(a) Is it true that numpy can take that big, irregularly shaped, list of lists of lists, and slurp it okay and quickly?

(b)然后,我希望能够转换该怪物树的叶子中的所有点.例如,将numpy放入每个子列表,然后进入每个子子列表,然后对于在这些子子列表中找到的每个坐标对,将第一个(lon坐标)乘以0.5"是什么表达式?我可以这样做吗?

(b) I then want to be able to transform all the points in the leaves of that monster tree. What is the expression to get numpy to, for instance, "go into each sublist, and then into each subsublist, and then for each coordinate pair you find in those subsublists multiply the first (lon coordinate) by 0.5"? Can I do that?

(c)最后,我需要找回那些变换后的坐标以便绘制它们.

(c) Finally, I need to get those transformed coordinates back out in order to plot them.

下面温斯顿的答案似乎暗示了我如何使用itertools可以做到这一点.我想要做的几乎与Winston所做的一样,将列表弄平了.但是我不能只是将其弄平.当我去绘制数据时,我需要能够知道一个多边形何时停止而下一个多边形何时开始.因此,我想如果有一种方法可以用特殊的坐标对(-1000,-1000)或类似的东西快速标记每个多边形(即每个子子列表)的末端,那么我可以使它起作用.然后,我可以像在Winston的答案中那样使用itertools进行展平,然后在numpy中进行转换.然后,我实际上需要使用PIL从点到点进行绘制,在这里,我想我需要将修改后的numpy数组重新分配回python列表,然后在常规python循环中迭代该列表以进行绘制.这似乎不是我最好的选择,只是只编写一个C模块即可一步一步为我完成所有读取和绘制工作?

Winston's answer below seems to give some hint at how I might do this all using itertools. What I want to do is pretty much like what Winston does, flattening the list out. But I can't quite just flatten it out. When I go to draw the data, I need to be able to know when one polygon stops and the next starts. So, I think I could make it work if there were a way to quickly mark the end of each polygon (i.e., each subsublist) with a special coordinate pair like (-1000, -1000) or something like that. Then I could flatten with itertools as in Winston's answer, and then do the transforms in numpy. Then I need to actually draw from point to point using PIL, and here I think I'd need to reassign the modified numpy array back to a python list, and then iterate through that list in a regular python loop to do the drawing. Does that seem like my best option short of just writing a C module to handle all the reading and drawing for me in one step?

推荐答案

您将数据描述为坐标列表列表".从这个我猜你的提取看起来像这样:

You describe your data as being "lists of lists of lists of coordinates". From this I'm guessing your extraction looks like this:

for x in points:
   for y in x:
       for Z in y:
           # z is a tuple with GPS coordinates

执行此操作:

# initially, points is a list of lists of lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing coordinates
points = itertools.chain.from_iterable(points)
# now points is an iterable producing individual floating points values
data = numpy.fromiter(points, float)
# data is a numpy array containing all the coordinates
data = data.reshape( data.size/2,2)
# data has now been reshaped to be an nx2 array

itertools和numpy.fromiter都是用c实现的,非常有效.结果,这应该很快完成转换.

itertools and numpy.fromiter are both implemented in c and really efficient. As a result, this should do the transformation very quickly.

问题的第二部分并没有真正表明您想要对数据做什么.索引numpy数组比索引python列表慢.您可以通过对数据进行大量操作来提高速度.在不了解有关如何处理该数据的更多信息的情况下,很难建议如何对其进行修复.

The second part of your question doesn't really indicate what you want do with the data. Indexing numpy array is slower then indexing python lists. You get speed by performing operations in mass on the data. Without knowing more about what you are doing with that data, its hard to suggest how to fix it.

更新:

我已经继续使用itertools和numpy完成了所有工作.对于因尝试理解此代码而造成的任何脑部伤害,我概不负责.

I've gone ahead and done everything using itertools and numpy. I am not responsible from any brain damage resulting from attempting to understand this code.

# firstly, we use imap to call GetMyPoints a bunch of times
objects = itertools.imap(GetMyPoints, xrange(100))
# next, we use itertools.chain to flatten it into all of the polygons
polygons = itertools.chain.from_iterable(objects)
# tee gives us two iterators over the polygons
polygons_a, polygons_b = itertools.tee(polygons)
# the lengths will be the length of each polygon
polygon_lengths = itertools.imap(len, polygons_a)
# for the actual points, we'll flatten the polygons into points
points = itertools.chain.from_iterable(polygons_b)
# then we'll flatten the points into values
values = itertools.chain.from_iterable(points)

# package all of that into a numpy array
all_points = numpy.fromiter(values, float)
# reshape the numpy array so we have two values for each coordinate
all_points = all_points.reshape(all_points.size // 2, 2)

# produce an iterator of lengths, but put a zero in front
polygon_positions = itertools.chain([0], polygon_lengths)
# produce another numpy array from this
# however, we take the cumulative sum
# so that each index will be the starting index of a polygon
polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) )

# now for the transformation
# multiply the first coordinate of every point by *.5
all_points[:,0] *= .5

# now to get it out

# polygon_positions is all of the starting positions
# polygon_postions[1:] is the same, but shifted on forward,
# thus it gives us the end of each slice
# slice makes these all slice objects
slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:]))
# polygons produces an iterator which uses the slices to fetch
# each polygon
polygons = itertools.imap(all_points.__getitem__, slices)

# just iterate over the polygon normally
# each one will be a slice of the numpy array
for polygon in polygons:
    draw_polygon(polygon)

您可能会发现最好一次处理一个多边形.将每个多边形转换成一个numpy数组,然后对其进行矢量操作.这样做,您可能会获得显着的速度优势.将所有数据放入numpy可能会有些困难.

You might find it best to deal with a single polygon at a time. Convert each polygon into a numpy array and do the vector operations on that. You'll probably get a significant speed advantage just doing that. Putting all of your data into numpy might be a little difficult.

由于您的数据形状奇特,所以这比大多数numpy的东西要难得多. Numpy几乎假设一个统一形状的数据世界.

This is more difficult then most numpy stuff because of your oddly shaped data. Numpy pretty much assumes a world of uniformly shaped data.

这篇关于numpy数组:快速填充和提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆