将Python对象存储在Python列表与固定长度的Numpy数组中 [英] Storing Python objects in a Python list vs. a fixed-length Numpy array
问题描述
在做一些生物信息学工作时,我一直在考虑将对象实例存储在Numpy数组而不是Python列表中的后果,但是在所有测试中,我在每个实例中的性能都较差.我正在使用CPython.有人知道原因吗?
特别是:
- 使用固定长度数组
numpy.ndarray(dtype=object)
与常规Python列表相比会对性能产生什么影响?我执行的初步测试表明,访问Numpy数组元素比通过Python列表进行迭代要慢,尤其是在使用对象方法时. - 为什么使用列表理解(例如
[ X() for i in range(n) ]
而不是numpy.empty(size=n, dtype=object)
)实例化对象更快? - 每个内存的开销是多少?我无法测试.如果有影响,我的课程会广泛使用
__slots__
.
对于此类事情,请勿在numpy中使用对象数组.
它们破坏了numpy数组的基本目的,尽管它们在少数情况下很有用,但它们几乎总是一个糟糕的选择.
是的,与使用list
进行的等效操作相比,访问python中的numpy数组的单个元素或遍历python中的numpy数组要慢. (这就是为什么当x
是一个numpy数组时,您永远不要执行y = [item * 2 for item in x]
之类的原因.)
Numpy对象数组的内存开销比列表要低一些,但是如果要存储这么多的python对象,那么首先会遇到其他内存问题.
首先,Numpy是一个内存有效的多维数组容器,用于存储统一的数值数据.如果要在numpy数组中保存任意对象,则可能需要一个列表.
我的观点是,如果您想有效地使用numpy,则可能需要重新考虑如何构造事物.
不是将每个对象实例存储在一个numpy数组中,而是将您的 numerical 数据存储在一个numpy数组中,并且如果您需要为每个行/列/任何对象使用单独的对象,请将索引存储到该数组中在每种情况下.
这样,您可以快速处理数字数组(即使用numpy而不是列表推导).
作为我正在谈论的内容的快速示例,这是一个不使用numpy的简单示例:
from random import random
class PointSet(object):
def __init__(self, numpoints):
self.points = [Point(random(), random()) for _ in xrange(numpoints)]
def update(self):
for point in self.points:
point.x += random() - 0.5
point.y += random() - 0.5
class Point(object):
def __init__(self, x, y):
self.x = x
self.y = y
points = PointSet(100000)
point = points.points[10]
for _ in xrange(1000):
points.update()
print 'Position of one point out of 100000:', point.x, point.y
以及使用numpy数组的类似示例:
import numpy as np
class PointSet(object):
def __init__(self, numpoints):
self.coords = np.random.random((numpoints, 2))
self.points = [Point(i, self.coords) for i in xrange(numpoints)]
def update(self):
"""Update along a random walk."""
# The "+=" is crucial here... We have to update "coords" in-place, in
# this case.
self.coords += np.random.random(self.coords.shape) - 0.5
class Point(object):
def __init__(self, i, coords):
self.i = i
self.coords = coords
@property
def x(self):
return self.coords[self.i,0]
@property
def y(self):
return self.coords[self.i,1]
points = PointSet(100000)
point = points.points[10]
for _ in xrange(1000):
points.update()
print 'Position of one point out of 100000:', point.x, point.y
还有其他方法可以做到这一点(例如,您可能希望避免在每个point
中存储对 specific numpy数组的引用),但我希望这是一个有用的示例. /p>
请注意它们运行速度的差异.在我的机器上,numpy版本与纯Python版本之间相差5秒,而纯Python版本相差60秒.
In doing some bioinformatics work, I've been pondering the ramifications of storing object instances in a Numpy array rather than a Python list, but in all the testing I've done the performance was worse in every instance. I am using CPython. Does anyone know the reason why?
Specifically:
- What are the performance impacts of using a fixed-length array
numpy.ndarray(dtype=object)
vs. a regular Python list? Initial tests I performed showed that accessing the Numpy array elements was slower than iteration through the Python list, especially when using object methods. - Why is it faster to instantiate objects using a list comprehension such as
[ X() for i in range(n) ]
instead of anumpy.empty(size=n, dtype=object)
? - What is the memory overhead of each? I was not able to test this. My classes extensively use
__slots__
, if that has any impact.
Don't use object arrays in numpy for things like this.
They defeat the basic purpose of a numpy array, and while they're useful in a tiny handful of situations, they're almost always a poor choice.
Yes, accessing an individual element of a numpy array in python or iterating through a numpy array in python is slower than the equivalent operation with a list
. (Which is why you should never do something like y = [item * 2 for item in x]
when x
is a numpy array.)
Numpy object arrays will have a slightly lower memory overhead than a list, but if you're storing that many individual python objects, you're going to run into other memory problems first.
Numpy is first and foremost a memory-efficient, multidimensional array container for uniform numerical data. If you want to hold arbitrary objects in a numpy array, you probably want a list, instead.
My point is that if you want to use numpy effectively, you may need to re-think how you're structuring things.
Instead of storing each object instance in a numpy array, store your numerical data in a numpy array, and if you need separate objects for each row/column/whatever, store an index into that array in each instance.
This way you can operate on the numerical arrays quickly (i.e. using numpy instead of list comprehensions).
As a quick example of what I'm talking about, here's a trivial example without using numpy:
from random import random
class PointSet(object):
def __init__(self, numpoints):
self.points = [Point(random(), random()) for _ in xrange(numpoints)]
def update(self):
for point in self.points:
point.x += random() - 0.5
point.y += random() - 0.5
class Point(object):
def __init__(self, x, y):
self.x = x
self.y = y
points = PointSet(100000)
point = points.points[10]
for _ in xrange(1000):
points.update()
print 'Position of one point out of 100000:', point.x, point.y
And a similar example using numpy arrays:
import numpy as np
class PointSet(object):
def __init__(self, numpoints):
self.coords = np.random.random((numpoints, 2))
self.points = [Point(i, self.coords) for i in xrange(numpoints)]
def update(self):
"""Update along a random walk."""
# The "+=" is crucial here... We have to update "coords" in-place, in
# this case.
self.coords += np.random.random(self.coords.shape) - 0.5
class Point(object):
def __init__(self, i, coords):
self.i = i
self.coords = coords
@property
def x(self):
return self.coords[self.i,0]
@property
def y(self):
return self.coords[self.i,1]
points = PointSet(100000)
point = points.points[10]
for _ in xrange(1000):
points.update()
print 'Position of one point out of 100000:', point.x, point.y
There are other ways to do this (you may want to avoid storing a reference to a specific numpy array in each point
, for example), but I hope it's a useful example.
Note the difference in speed at which they run. On my machine, it's a difference of 5 seconds for the numpy version vs 60 seconds for the pure-python version.
这篇关于将Python对象存储在Python列表与固定长度的Numpy数组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!