deepcopy和python-避免使用的提示? [英] deepcopy and python - tips to avoid using it?

查看:65
本文介绍了deepcopy和python-避免使用的提示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的python例程,其中涉及循环遍历大约20,000个纬度,经度坐标的列表,并计算每个点到参考点的距离。

I have a very simple python routine that involves cycling through a list of roughly 20,000 latitude,longitude coordinates and calculating the distance of each point to a reference point.

def compute_nearest_points( lat, lon, nPoints=5 ):
    """Find the nearest N points, given the input coordinates."""

    points = session.query(PointIndex).all()
    oldNearest = []
    newNearest = []
    for n in xrange(nPoints):
        oldNearest.append(PointDistance(None,None,None,99999.0,99999.0))
        newNearest.append(obj2)

    #This is almost certainly an inappropriate use of deepcopy
    #  but how SHOULD I be doing this?!?!
    for point in points:
        distance = compute_spherical_law_of_cosines( lat, lon, point.avg_lat, point.avg_lon )
        k = 0
        for p in oldNearest:
            if distance < p.distance:
                newNearest[k] = PointDistance(
                    point.point, point.kana, point.english, point.avg_lat, point.avg_lon, distance=distance
                    )
                break
            else:
                newNearest[k] = deepcopy(oldNearest[k])
            k += 1
        for j in range(k,nPoints-1):
            newNearest[j+1] = deepcopy(oldNearest[j])
        oldNearest = deepcopy(newNearest)

    #We're done, now print the result
    for point in oldNearest:
        print point.station, point.english, point.distance

    return

我最初使用完全相同的方法在C中编写了此代码,并且在那儿工作得很好,并且对于nPoints <= 100基本上是瞬时的。所以我决定将其移植到python,因为我想使用SqlAlchemy来做其他事情。

I initially wrote this in C, using the exact same approach, and it works fine there, and is basically instantaneous for nPoints<=100. So I decided to port it to python because I wanted to use SqlAlchemy to do some other stuff.

我首先将其移植时没有使用现成方法的deepcopy语句,这导致结果奇数或部分不正确,因为其中有些要点只是被复制为引用(我猜?我想吗?),但是它仍然几乎与C版本一样快。

I first ported it without the deepcopy statements that now pepper the method, and this caused the results to be 'odd', or partially incorrect, because some of the points were just getting copied as references(I guess? I think?) -- but it was still pretty nearly as fast as the C version.

现在,添加了deepcopy调用后,该例程可以正确地完成工作,但是会带来极大的性能损失,并且现在需要几秒钟来完成相同的工作。

Now with the deepcopy calls added, the routine does it's job correctly, but it has incurred an extreme performance penalty, and now takes several seconds to do the same job.

这似乎是一项很普通的工作,但我显然不是以Python方式进行的。我应该如何做才能使我仍然获得正确的结果,而不必到处都包含Deepcopy?

This seems like a pretty common job, but I'm clearly not doing it the pythonic way. How should I be doing this so that I still get the correct results but don't have to include deepcopy everywhere?

编辑:

我找到了一个更简单,更快速的解决方案,


I've hit on a much simpler and faster solution,

def compute_nearest_points2( lat, lon, nPoints=5 ):
    """Find the nearest N points, given the input coordinates."""

    points = session.query(PointIndex).all()
    nearest = []

    for point in points:
        distance = compute_spherical_law_of_cosines( lat, lon, point.avg_lat, point.avg_lon )
        nearest.append( 
            PointDistance(
                point.point, point.kana, point.english, point.avg_lat, point.avg_lon, distance=distance
                )
            )

    nearest_points = sorted(nearest, key=lambda point: point.distance)[:nPoints]     
    for item in nearest_points:
        print item.point, item.english, item.distance
    return

所以基本上我只是在做一个完整的输入的副本并附加一个新值-距离从参考点开始。然后,我将 sorted应用于结果列表,并指定sort键应为PointDistance对象的distance属性。

So basically I'm just making a complete copy of the input and appending a new value - the distance from the reference point. Then I just apply 'sorted' to the resulting list, specifying that the sort key should be the distance property of the PointDistance object.

这比使用deepcopy快得多,尽管我承认我真的不明白为什么。我想这取决于有效的C实现python的排序了吗?

This is much faster than using deepcopy although I confess I don't really understand why. I guess it is down to the efficient C implementations python's "sorted"?

推荐答案

好的,首先要最简单的事情:

Okay, simplest things first:


  1. deepcopy 通常很慢,因为它必须进行大量内部记账才能以理智的方式复制病理情况,例如包含自身的对象。参见例如此页面,或者查看Python中位于 copy.py 中的 deepcopy 的源代码

  1. deepcopy is slow in general since it has to do a lot of internal bookkeeping to copy pathological cases like objects containing themselves in a sane way. See, for instance, this page, or take a look at the source code of deepcopy in copy.py that is somewhere in your Python path.

排序是快速的,因为它是用C实现的。比等效排序要快得多

sorted is fast, since it is implemented in C. Much faster than an equivalent sort in Python.

现在,正如您在评论中所问的那样,还有更多关于Python引用计数行为的信息。在Python中,变量是引用。当您说 a = 1 时,请考虑它具有 1 作为自己存在的对象,而 a 只是附加的标签。在C等其他语言中,变量是容器(不是标签),当您执行 a = 1 时,实际上将1放入了 a 。这不适用于Python,变量是引用。这会带来一些有趣的结果,您可能还偶然发现过:

Now, some more thing about Python's reference counting behaviour as you asked in your comment. In Python, variables are references. When you say a=1, think about it has having 1 as an object existing on its own, and a is just a tag attached to it. In some other languages like C, variables are containers (not tags), and when you do a=1, you actually put 1 into a. This does not hold for Python, where variables are references. This has some interesting consequences that you may have also stumbled upon:

>>> a = []      # construct a new list, attach a tag named "a" to it
>>> b = a       # attach a tag named "b" to the object which is tagged by "a"
>>> a.append(1) # append 1 to the list tagged by "a"
>>> print b     # print the list tagged by "b"
[1]

此行为是可以看到,因为列表是 mutable 对象:您可以在创建列表后对其进行修改,并且在通过引用该列表的任何变量访问列表时可以看到该修改。列表的不可变等效项是元组:

This behaviour is seen because lists are mutable objects: you can modify a list after you have created it, and the modification is seen when accessing the list through any of the variables that refer to it. The immutable equivalents of lists are tuples:

>>> a = ()      # construct a new tuple, attach a tag named "a" to it
>>> b = a       # now "b" refers to the same empty tuple as "a"
>>> a += (1, 2) # appending some elements to the tuple
>>> print b
()

此处, a + =(1 ,2) a 所引用的现有元组中创建一个 new 元组,再加上一个(1,2)是即时构建的,而 a 被调整为指向新的元组,当然 b 仍指旧的元组。对于简单的数字加法运算,例如 a = a + 2 ,情况也是如此:在这种情况下, a 不会以任何方式突变,Python会构造一个​​新数字并移动 a 指向新数字。简而言之,数字,字符串和元组是不可变的;列表,字典和集合是可变的。用户定义的类通常是可变的,除非您明确确保内部状态不能突变。还有 frozenset ,这是一个不变的集合。当然还有其他很多:)

Here, a += (1, 2) creates a new tuple from the existing tuple referred to by a, plus another tuple (1, 2) that is constructed on-the-fly, and a is adjusted to point to the new tuple, while of course b still refers to the old tuple. The same happens with simple numeric additions like a = a+2: in this case, the number originally pointed to by a is not mutated in any way, Python "constructs" a new number and moves a to point to the new number. So, in a nutshell: numbers, strings and tuples are immutable; lists, dicts and sets are mutable. User-defined classes are in general mutable unless you ensure explicitly that the internal state cannot be mutated. And there's frozenset, which is an immutable set. Plus many others of course :)

我不知道您的原始代码为什么不起作用,但是您可能遇到了与我自己的代码段有关的行为默认情况下,与列表一起显示的 PointDistance 类也是可变的。另一种可能是 collections 中的 namedtuple 类,该类构造了一个类似元组的对象,其字段也可以由名称。例如,您可以这样做:

I don't know why your original code didn't work, but probably you hit a behaviour related to the code snippet I've shown with the lists as your PointDistance class is also mutable by default. An alternative could be the namedtuple class from collections, which constructs a tuple-like object whose fields can also be accessed by names. For instance, you could have done this:

from collections import namedtuple
PointDistance = namedtuple("PointDistance", "point distance")

这将创建 PointDistance 具有两个命名字段的类: point distance 。在主 for 循环中,您可以适当地分配它们。由于 point 字段所指向的点对象在 for 循环过程中不会被修改,并且距离是一个数字(根据定义,它是不可变的),您应该这样做是安全的。但是是的,总的来说,由于在C中实现了 sorted ,因此简单地使用 sorted 似乎更快。 heapq 模块很幸运,该模块实现了由普通Python列表支持的堆数据结构,因此它使您可以找到顶部的 k 元素,而无需对其他元素进行排序。但是,由于 heapq 也是在Python中实现的,因此除非您有很多观点,否则 sorted 可能会更好地工作。

This creates a PointDistance class for you that has two named fields: point and distance. In your main for loop, you can assign these appropriately. Since the point objects pointed to by the point fields won't be modified during the course of your for loop, and distance is a number (which is, by definition, immutable), you should be safe doing this way. But yes, in general, it seems like simply using sorted is faster since sorted is implemented in C. You might also be lucky with the heapq module, which implements a heap data structure backed by an ordinary Python list, therefore it lets you find the top k elements easily without having to sort the others. However, since heapq is also implemented in Python, chances are that sorted works better unless you have a whole lot of points.

最后,我想补充一点,到目前为止,我从未使用过 deepcopy 在大多数情况下,有很多方法可以避免这种情况。

Finally, I'd like to add that I never had to use deepcopy so far, so I guess there are ways to avoid it in most cases.

这篇关于deepcopy和python-避免使用的提示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆