pandas 和numpy线程安全 [英] pandas and numpy thread safety

查看:133
本文介绍了 pandas 和numpy线程安全的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Web服务器(apache + modwsgi + django)上使用pandas,并且有一个难以复制的错误,现在我发现是由于熊猫不是线程安全的.

I'm using pandas on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.

经过大量的代码减少,我终于找到了一个简短的独立程序,可以用来重现该问题.您可以在下面看到它.

After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.

重点是:与此问题的答案相反,该示例表明即使不修改数据框的非常简单的操作,熊猫也可能崩溃.我无法想象这个简单的代码片段在线程中可能是不安全的...

The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...

问题是有关在Web服务器中使用pandas和numpy的问题.是否有可能?我应该如何使用熊猫修复代码? (使用锁的示例会有所帮助)

The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)

以下是导致细分错误的代码:

Here is the code which causes a Segmentation Fault:

import threading
import pandas as pd
import numpy as np

def let_crash(crash=True):
    t = 0.02 * np.arange(100000) # ok con 10000                                                                               
    data = pd.DataFrame({'t': t})
    if crash:
        data['t'] * 1.5  # CRASH
    else:
        data['t'].values * 1.5  # THIS IS OK!

if __name__ == '__main__':
        threads = []
        for i in range(100):
            if True:  # asynchronous                                                                                          
                t = threading.Thread(target=let_crash, args = ())
                t.daemon = True
                t.start()
                threads.append(t)
            else:  # synchronous                                                                                              
                let_crash()
        for t in threads:
            t.join()

我的环境:python 2.7.3,numpy 1.8.0,pandas 0.13.1

My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1

推荐答案

请参阅以下文档中的警告:

see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

pandas不是线程安全的,因为底层的复制机制不是.我相信Numpy可以执行原子复制操作,但是pandas可以在其上面进行操作.

pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.

复制是熊猫操作的基础(因为大多数操作都会生成一个新对象以返回给用户)

Copy is the basis of pandas operations (as most operations generate a new object to return to the user)

解决此问题并非易事,并且会带来相当高的性能成本,因此需要一些工作才能正确处理此问题.

It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.

最简单的方法就是不跨线程共享对象或锁定对象使用权限.

Easiest is simply not to share objects across threads or lock them on usage.

这篇关于 pandas 和numpy线程安全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆