pandas 和numpy线程安全 [英] pandas and numpy thread safety
问题描述
我正在Web服务器(apache + modwsgi + django)上使用pandas
,并且有一个难以复制的错误,现在我发现是由于熊猫不是线程安全的.
I'm using pandas
on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.
经过大量的代码减少,我终于找到了一个简短的独立程序,可以用来重现该问题.您可以在下面看到它.
After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.
重点是:与此问题的答案相反,该示例表明即使不修改数据框的非常简单的操作,熊猫也可能崩溃.我无法想象这个简单的代码片段在线程中可能是不安全的...
The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...
问题是有关在Web服务器中使用pandas和numpy的问题.是否有可能?我应该如何使用熊猫修复代码? (使用锁的示例会有所帮助)
The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)
以下是导致细分错误的代码:
Here is the code which causes a Segmentation Fault:
import threading
import pandas as pd
import numpy as np
def let_crash(crash=True):
t = 0.02 * np.arange(100000) # ok con 10000
data = pd.DataFrame({'t': t})
if crash:
data['t'] * 1.5 # CRASH
else:
data['t'].values * 1.5 # THIS IS OK!
if __name__ == '__main__':
threads = []
for i in range(100):
if True: # asynchronous
t = threading.Thread(target=let_crash, args = ())
t.daemon = True
t.start()
threads.append(t)
else: # synchronous
let_crash()
for t in threads:
t.join()
我的环境:python 2.7.3,numpy 1.8.0,pandas 0.13.1
My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1
推荐答案
see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety
pandas不是线程安全的,因为底层的复制机制不是.我相信Numpy可以执行原子复制操作,但是pandas可以在其上面进行操作.
pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.
复制是熊猫操作的基础(因为大多数操作都会生成一个新对象以返回给用户)
Copy is the basis of pandas operations (as most operations generate a new object to return to the user)
解决此问题并非易事,并且会带来相当高的性能成本,因此需要一些工作才能正确处理此问题.
It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.
最简单的方法就是不跨线程共享对象或锁定对象使用权限.
Easiest is simply not to share objects across threads or lock them on usage.
这篇关于 pandas 和numpy线程安全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!