pandas 性能问题-需要帮助以进行优化 [英] pandas performance issue - need help to optimize
问题描述
我写了一些python代码,大量使用了pandas库.该代码似乎有点慢,所以我通过cProfile运行它来查看瓶颈在哪里. 根据cProfile结果的瓶颈之一是对pandas.lib_scalar_compare的调用:
I wrote some python code that makes heavy use of the pandas library. The code seems to be a bit slow, so I ran it through cProfile to see where the bottlenecks are. One of the the bottlenecks according to the cProfile results is the call to pandas.lib_scalar_compare:
1604 262.301 0.164 262.301 0.164 {pandas.lib.scalar_compare}
我的问题是-在什么情况下会被称为?当我选择DataFrame的一部分时,我假设它是. 这是我的代码:
My question is this - under what circumstances does this get called ? I assume its when I do selecting of part of a DataFrame. Here is what my code looks like:
if (var=='9999'):
dataTable=resultTable.ix[(resultTable['col1'] == var1)
& (resultTable['col2']==var2)].copy()
else:
dataTable=resultTable.ix[(resultTable['col1'] == var1)
& (resultTable['col2']==var2)
& (resultTable['col3']==int(val3))].copy()
我有以下问题:
- 是最终调用导致瓶颈的代码的代码段吗?
- 如果是这样,有什么可以优化的吗? 我当前使用的熊猫版本是 pandas-0.8 .
- Is that the code snippet that eventually calls the code that causes the bottleneck?
- If so, is there anyway to optimize this? The version of pandas I am currently using is pandas-0.8.
在此方面的任何帮助将不胜感激.
Any help on this would be greatly appreciated.
推荐答案
我的代码在pandas.lib.scalar_compare中花费了大量时间,并且通过转换基于字符串的数据类型,我能够将速度提高10倍类别"列.
My code was spending a ton of time in pandas.lib.scalar_compare, and I was able to increase the speed by 10x by converting the datatype of string-based columns to 'category'.
例如:
$ df['ResourceName'] = df['ResourceName'].astype('category')
有关更多信息,请参见 https://www.continuum.io/content/pandas-categoricals
For more information, see https://www.continuum.io/content/pandas-categoricals
这篇关于 pandas 性能问题-需要帮助以进行优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!