在类级别跟踪Python 2.7.x对象属性以快速构造numpy数组 [英] Tracking Python 2.7.x object attributes at class level to quickly construct numpy array

查看:53
本文介绍了在类级别跟踪Python 2.7.x对象属性以快速构造numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我们有一个类实例的列表,所有这些实例都有一个我们知道是浮点数的属性-称为属性x.在程序的各个点上,我们希望提取一个包含所有x值的numpy数组,以便对x的分布进行一些分析.此提取过程已经完成很多,并且被确定为程序的慢速部分.这是一个非常简单的示例,可以具体说明我的想法:

Say we have a list of instances of a class, which all have an attribute that we know is a float -- call the attribute x. At various points in a program, we want to extract a numpy array of all values of x for running some analysis on the distribution of x. This extraction process is done a lot, and it's been identified as a slow part of the program. Here is an extremely simple example to illustrate specifically what I have in mind:

import numpy as np

# Create example object with list of values
class stub_object(object):
    def __init__(self, x):
        self.x = x

# Define a list of these fake objects
stubs = [stub_object(i) for i in range(10)]

# ...much later, want to quickly extract a vector of this particular attribute:
numpy_x_array = np.array([a_stub.x for a_stub in stubs])

这里是问题:在存根"列表中跨stub_object实例跟踪"x"属性是否有一种聪明,快捷的方法,因此构造"numpy_x_array"比上面的过程要快吗?

Here's the question: is there a clever, faster way to track the "x" attribute across instances of stub_object in the "stubs" list, such that constructing the "numpy_x_array" is faster than the process above?

这是我要尝试的一个粗略想法:我可以创建一个全局到类类型"的numpy向量,该向量将随着对象集的更新而更新,但是我可以在任何需要的时间高效地进行操作?

Here's a rough idea I am trying to hammer out: can I create a "global to the class type" numpy vector, which will update as the set of objects updates, but I can operate on efficiently any time I want?

我真正想要的只是朝着正确的方向前进".提供我可以在Google中搜索/进一步搜索SO/文档的关键字正是我想要的.

All I am really looking for is a "nudge in the right direction." Providing keywords I can google / search SO / docs further is exactly what I am looking for.

对于它的价值,我已经研究了这些,这些使我走得更远,但还不完全是:

For what it is worth, I've looked into these, which have gotten me a little further but not completely there:

  • 从NumPy中的对象数组中获取属性
    • 我认为recarray解决方案将不起作用,因为我的对象比接受的答案中描述的类结构"对象更复杂.
    • init 函数进行矢量化很有趣,我会尝试使用它(但考虑到真正的,非stub_object初始化结构,它可能会变得复杂)
    • vectorizing the the init function is interesting, which I will try (but suspect it may get complicated given true, non-stub_object init structure)
    • 这个问号提醒我,numpy数组是可变的,这可能就是答案.这是功能还是将来需要纠正的错误?

    我看过的其他工具,没有帮助:

    Others I looked at, which were not as helpful:

    • numpy arrays: filling and extracting data quickly
    • Numpy array of object attributes

    (当然,一种选择是简单地"检修代码的结构,以便有一个大对象(例如stub_population)代替"stub_objects"的"stubs"列表.列表和/或numpy数组中的属性,以及仅作用于这些数组元素的方法,这样做的缺点是大量重构,并且降低了将"stub_object"建模为自己的东西的抽象性和灵活性. d如果有聪明的方法可以避免这种情况.)

    (One option, of course, is to "simply" overhaul the structure of the code, such that instead of a "stubs" list of "stub_objects," there is one large object, something like stub_population, which maintains the relevant attributes in lists and/or numpy arrays, and methods that simply act on elements of those arrays. The downside to that is lots of refactoring, and some reduction of the abstraction and flexibility of modeling the "stub_object" as its own thing. I'd like to avoid this if there is a clever way to do so.)

    修改:我使用的是2.7.x

    Edit: I am using 2.7.x

    @hpaulj,您的示例提供了很大帮助-接受了答案.

    Edit 2: @hpaulj, your example has been a big help -- answer accepted.

    这是上面示例代码的极其简单的首遍版本,可以满足我的要求.有非常初步的迹象表明可能实现一个量级的加速,而没有对代码体进行重大重新安排. 非常好.谢谢!

    Here's the extremely simple first-pass version of the example code above that is doing what I want. There are very prelim indications of possible one order-magnitude speedup, without significant rearrangement of code body. Excellent. Thanks!

    size = 20
    
    # Create example object with list of values
    class stub_object(object):
        _x = np.zeros(size, dtype=np.float64)
    
        def __init__(self, x, i):
            # A quick cop-out for expanding the array:
            if i >= len(self._x):
                raise Exception, "Index i = " +str(i)+ " is larger than allowable object size of len(self._x) = "+ str(self._x)
            self.x = self._x[i:i+1]
            self.set_x(x)
    
        def get_x(self):
            return self.x[0]
    
        def set_x(self, x_new):
            self.x[0] = x_new
    
    # Examine:
    
    # Define a list of these fake objects
    stubs = [stub_object(x=i**2, i) for i in range(size)]
    
    # ...much later, want to quickly extract a vector of this particular attribute:
    #numpy_x_array = np.array([a_stub.x for a_stub in stubs])
    
    # Now can do: 
    numpy_x_array = stub_object._x  # or
    numpy_x_array = stubs[0]._x     # if need to use the list to access
    

    还没有使用属性,但是确实非常喜欢这个想法,并且在使代码几乎保持不变方面应该走很长的路要走.

    Not using properties yet, but really like that idea a lot, and it should go a long way in making code very close to unchanged.

    推荐答案

    基本问题是您的对象是通过内存存储的,每个对象的字典中都有该属性.但是对于数组工作,这些值必须存储在连续的数据缓冲区中.

    The basic problem is that your objects are stored through out memory, with the attribute in each object's dictionary. But for array work, the values have to be stored in a contiguous databuffer.

    我已经在其他SO问题中对此进行了探讨,但是您发现的问题更早了.仍然我没什么可补充的.

    I've explored this in other SO questions, but the ones you found are earlier. Still I don't have a great deal to add.

    np.array([a_stub.x for a_stub in stubs])
    

    使用itertoolsfromiter的替代方法应该不会改变速度,因为时间消耗者可以访问a_stub.x,而不是迭代机制.您可以通过测试更简单的东西来验证这一点

    The alternatives using itertools or fromiter shouldn't change speed much because the time consumer is a_stub.x access, not so much the iteration mechanism. You could verify that by testing against something simpler like

    np.array([1 for _ in range(len(stubs))]
    

    我怀疑最好的选择是使用一个或多个数组作为主要存储,并重构您的类,以便从该存储中获取属性.

    I suspect the best option is to use one or more arrays as the primary storage, and refactor your class so that the attribute is fetched from that storage.

    如果您知道将有10个对象,请制作一个具有该大小的空数组.创建对象时,可以为其分配唯一索引. x属性可以是property,谁正在使用getter/setter访问该数组的data[i]元素.通过将x设置为属性而不是主要属性,您应该能够保留大多数对象机制.您可以通过简单地更改几种存储方法来尝试使用不同的存储方法.

    If you know you'll have 10 objects, then make an empty array of that size. When you create the object you assign it a unique index. The x attribute can be a property who's getter/setter accesses the data[i] element of that array. By making x a property instead of a primary attribute, you should be able to keep most of the object machinery. And you can experiment with different storage methods by simply changing a couple of methods.

    我试图使用class属性作为主要数组存储来勾勒出这一点,但是我仍然有一些错误.

    I was trying to sketch this out using a class attribute as the primary array storage, but I still have some bugs.

    具有x属性的类可访问数组:

    Class with x property that accesses an array:

    class MyObj(object):
        xdata = np.zeros(10)
        def __init__(self,idx, x):
            self._idx = idx
            self.set_x(x)
        def set_x(self,x):
            self.xdata[self._idx] = x
        def get_x(self):
            return self.xdata[self._idx]
        def __repr__(self):
            return "<obj>x=%s"%self.get_x()    
        x = property(get_x, set_x)
    
    In [67]: objs = [MyObj(i, 3*i) for i in range(10)]
    In [68]: objs
    Out[68]: 
    [<obj>x=0.0,
     <obj>x=3.0,
     <obj>x=6.0,
     ...
     <obj>x=27.0]
    In [69]: objs[3].x
    Out[69]: 9.0
    In [70]: objs[3].xdata
    Out[70]: array([  0.,   3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.])
    In [71]: objs[3].xdata += 3
    In [72]: [o.x for o in objs]
    Out[72]: [3.0, 6.0, 9.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0]
    

    就地更改阵列是最容易的.但是也可以替换数组本身(从而增长"类集)

    In place change to the array is easiest. But it is also possible to replace the array itself (and thus 'grow' the class set)

    In [79]: MyObj.xdata=np.ones((20,))    
    In [80]: a = MyObj(11,25)
    In [81]: a
    Out[81]: <obj>x=25.0
    In [82]: MyObj.xdata
    Out[82]: 
    array([  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,
            25.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.])
    In [83]: [o.x for o in objs]
    Out[83]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
    

    我们必须注意修改属性.例如,我尝试过

    We have to careful about modifying attributes. For example I tried

    objs[3].xdata += 3
    

    打算更改整个班级的xdata.但这最终只是为该对象分配了一个新的xdata数组.我们还应该能够自动增加对象索引(这些天来,我比Python类结构更熟悉numpy方法).

    intending to change xdata for the whole class. But this ended up assigning a new xdata array just for that object. We should also be able to auto-increment the object index (these days I'm more familiar with numpy methods than Python class structures).

    如果我将getter替换为获取切片的一个:

    If I replace the getter with one that fetches a slice:

     def get_x(self):
         return self.xdata[self._idx:self._idx+1]
    
    In [107]: objs=[MyObj(i,i*3) for i in range(10)]
    In [109]: objs
    Out[109]: 
    [<obj>x=[ 0.],
     <obj>x=[ 3.],
     ...
     <obj>x=[ 27.]]
    

    np.info(或.__array_interface__)为我提供了有关xdata数组的信息,包括其数据缓冲区指针:

    np.info (or .__array_interface__) gives me information about the xdata array, including its databuffer pointer:

    In [110]: np.info(MyObj.xdata)
    class:  ndarray
    shape:  (10,)
    strides:  (8,)
    itemsize:  8
    aligned:  True
    contiguous:  True
    fortran:  True
    data pointer: 0xabf0a70
    byteorder:  little
    byteswap:  False
    type: float64
    

    第一个对象的切片指向同一位置:

    The slice for the 1st object, points to the same place:

    In [111]: np.info(objs[0].x)
    class:  ndarray
    shape:  (1,)
    strides:  (8,)
    itemsize:  8
    ....
    data pointer: 0xabf0a70
    ...
    

    下一个对象指向下一个浮点数(另外8个字节):

    The next object points to the next float (8 bytes further):

    In [112]: np.info(objs[1].x)
    class:  ndarray
    shape:  (1,)
    ...
    data pointer: 0xabf0a78
    ....
    

    我不确定通过切片/视图进行访问是否值得.

    I'm not sure that access by slice/view is worth it or not.

    这篇关于在类级别跟踪Python 2.7.x对象属性以快速构造numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆