将Python对象存储在Python列表与固定长度的Numpy数组中 [英] Storing Python objects in a Python list vs. a fixed-length Numpy array

查看:299
本文介绍了将Python对象存储在Python列表与固定长度的Numpy数组中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在做一些生物信息学工作时,我一直在考虑将对象实例存储在Numpy数组而不是Python列表中的后果,但是在所有测试中,我在每个实例中的性能都较差.我正在使用CPython.有人知道原因吗?

特别是:

  • 使用固定长度数组numpy.ndarray(dtype=object)与常规Python列表相比会对性能产生什么影响?我执行的初步测试表明,访问Numpy数组元素比通过Python列表进行迭代要慢,尤其是在使用对象方法时.
  • 为什么使用列表理解(例如[ X() for i in range(n) ]而不是numpy.empty(size=n, dtype=object))实例化对象更快?
  • 每个内存的开销是多少?我无法测试.如果有影响,我的课程会广泛使用__slots__.

解决方案

对于此类事情,请勿在numpy中使用对象数组.

它们破坏了numpy数组的基本目的,尽管它们在少数情况下很有用,但它们几乎总是一个糟糕的选择.

是的,与使用list进行的等效操作相比,访问python中的numpy数组的单个元素或遍历python中的numpy数组要慢. (这就是为什么当x是一个numpy数组时,您永远不要执行y = [item * 2 for item in x]之类的原因.)

Numpy对象数组的内存开销比列表要低一些,但是如果要存储这么多的python对象,那么首先会遇到其他内存问题.

首先,Numpy是一个内存有效的多维数组容器,用于存储统一的数值数据.如果要在numpy数组中保存任意对象,则可能需要一个列表.


我的观点是,如果您想有效地使用numpy,则可能需要重新考虑如何构造事物.

不是将每个对象实例存储在一个numpy数组中,而是将您的 numerical 数据存储在一个numpy数组中,并且如果您需要为每个行/列/任何对象使用单独的对象,请将索引存储到该数组中在每种情况下.

这样,您可以快速处理数字数组(即使用numpy而不是列表推导).

作为我正在谈论的内容的快速示例,这是一个不使用numpy的简单示例:

from random import random

class PointSet(object):
    def __init__(self, numpoints):
        self.points = [Point(random(), random()) for _ in xrange(numpoints)]

    def update(self):
        for point in self.points:
            point.x += random() - 0.5
            point.y += random() - 0.5

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

以及使用numpy数组的类似示例:

import numpy as np

class PointSet(object):
    def __init__(self, numpoints):
        self.coords = np.random.random((numpoints, 2))
        self.points = [Point(i, self.coords) for i in xrange(numpoints)]

    def update(self):
        """Update along a random walk."""
        # The "+=" is crucial here... We have to update "coords" in-place, in
        # this case. 
        self.coords += np.random.random(self.coords.shape) - 0.5

class Point(object):
    def __init__(self, i, coords):
        self.i = i
        self.coords = coords

    @property
    def x(self):
        return self.coords[self.i,0]

    @property
    def y(self):
        return self.coords[self.i,1]


points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

还有其他方法可以做到这一点(例如,您可能希望避免在每个point中存储对 specific numpy数组的引用),但我希望这是一个有用的示例. /p>

请注意它们运行速度的差异.在我的机器上,numpy版本与纯Python版本之间相差5秒,而纯Python版本相差60秒.

In doing some bioinformatics work, I've been pondering the ramifications of storing object instances in a Numpy array rather than a Python list, but in all the testing I've done the performance was worse in every instance. I am using CPython. Does anyone know the reason why?

Specifically:

  • What are the performance impacts of using a fixed-length array numpy.ndarray(dtype=object) vs. a regular Python list? Initial tests I performed showed that accessing the Numpy array elements was slower than iteration through the Python list, especially when using object methods.
  • Why is it faster to instantiate objects using a list comprehension such as [ X() for i in range(n) ] instead of a numpy.empty(size=n, dtype=object)?
  • What is the memory overhead of each? I was not able to test this. My classes extensively use __slots__, if that has any impact.

解决方案

Don't use object arrays in numpy for things like this.

They defeat the basic purpose of a numpy array, and while they're useful in a tiny handful of situations, they're almost always a poor choice.

Yes, accessing an individual element of a numpy array in python or iterating through a numpy array in python is slower than the equivalent operation with a list. (Which is why you should never do something like y = [item * 2 for item in x] when x is a numpy array.)

Numpy object arrays will have a slightly lower memory overhead than a list, but if you're storing that many individual python objects, you're going to run into other memory problems first.

Numpy is first and foremost a memory-efficient, multidimensional array container for uniform numerical data. If you want to hold arbitrary objects in a numpy array, you probably want a list, instead.


My point is that if you want to use numpy effectively, you may need to re-think how you're structuring things.

Instead of storing each object instance in a numpy array, store your numerical data in a numpy array, and if you need separate objects for each row/column/whatever, store an index into that array in each instance.

This way you can operate on the numerical arrays quickly (i.e. using numpy instead of list comprehensions).

As a quick example of what I'm talking about, here's a trivial example without using numpy:

from random import random

class PointSet(object):
    def __init__(self, numpoints):
        self.points = [Point(random(), random()) for _ in xrange(numpoints)]

    def update(self):
        for point in self.points:
            point.x += random() - 0.5
            point.y += random() - 0.5

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

And a similar example using numpy arrays:

import numpy as np

class PointSet(object):
    def __init__(self, numpoints):
        self.coords = np.random.random((numpoints, 2))
        self.points = [Point(i, self.coords) for i in xrange(numpoints)]

    def update(self):
        """Update along a random walk."""
        # The "+=" is crucial here... We have to update "coords" in-place, in
        # this case. 
        self.coords += np.random.random(self.coords.shape) - 0.5

class Point(object):
    def __init__(self, i, coords):
        self.i = i
        self.coords = coords

    @property
    def x(self):
        return self.coords[self.i,0]

    @property
    def y(self):
        return self.coords[self.i,1]


points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

There are other ways to do this (you may want to avoid storing a reference to a specific numpy array in each point, for example), but I hope it's a useful example.

Note the difference in speed at which they run. On my machine, it's a difference of 5 seconds for the numpy version vs 60 seconds for the pure-python version.

这篇关于将Python对象存储在Python列表与固定长度的Numpy数组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆