pandas DataFrame表现 [英] Pandas DataFrame performance

查看:138
本文介绍了 pandas DataFrame表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

熊猫真的很棒,但是我从Pandas.DataFrame中检索值的效率是非常惊讶的。在下面的玩具示例中,甚至DataFrame.iloc方法比字典慢100倍。



问题:这里的教训就是字典是更好的查找价值的方法吗?是的,我知道这正是他们所做的。但我只是想知道有没有关于DataFrame查找性能的错误。



我意识到这个问题比询问更迷惑,但我会接受一个答案这提供了对此的洞察力或观点。谢谢。

  import timeit 

setup ='''
import numpy,pandas
df = pandas.DataFrame(numpy.zeros(shape = [10,10]))
dictionary = df.to_dict()
'''

f = [ 'value = dictionary [5] [5]','value = df.loc [5,5]','value = df.iloc [5,5]']

for f :
print func
print min(timeit.Timer(func,setup).repeat(3,100000))




value = dictionary [5] [5]



0.130625009537



value = df.loc [5,5]



19.4681699276



value = .iloc [5,5]



17.2575249672



解决方案

一个dict是一个DataFrame,因为一辆自行车是一辆汽车。
你可以骑自行车比脚踏车10英尺快,可以开车,换车等等,但是如果你需要去一英里,汽车就会赢。



对于某些小的,有针对性的目的,一个dict可能会更快。
如果这是所有你需要的,那么使用一个dict,当然!但是,如果你需要/想要一个DataFrame的强大和豪华,那么一个dict就不能替代。如果数据结构不能满足您的需要,比较速度是没有意义的。



现在例如 - 更具体 - 一个dict对访问列有好处,但访问行并不方便。

  import timeit 

setup =''
import numpy,pandas
df = pandas.DataFrame(numpy.zeros(shape = [10,1000]))
dictionary = df.to_dict()
''

#f = 'value = dictionary [5] [5]','value = df.loc [5,5]','value = df.iloc [5,5]']
f = ['value = [val [ 5] for col,val in dictionary.items()]','value = df.loc [5]','value = df.iloc [5]']

for fc中的func:
print(func)
print(min(timeit.Timer(func,setup).repeat(3,100000)))

产生

  value = [val [5] for col,val in dictionary .iteritems()] 
25.5416321754
value = df.loc [5]
5.68071913719
value = df.iloc [5]
4.56006002426

所以列表的dict在检索行比 df.iloc 。随着列数的增加,速度的不断变大。 (列数与自行车类比中的脚数一样,距离越长,车变得越方便...)



这只是一个列表名称比DataFrame更方便/更慢的例子。



另一个例子是当你有一个DatetimeIndex的行,并希望选择所有的行在某些日期之间。使用DataFrame,您可以使用

  df.loc ['2000-1-1':'2000-3-31'] 

如果您要使用列表的dict,没有简单的模拟。而您需要使用的Python循环才能选择正确的行,与DataFrame相比会再慢一些。


Pandas is really great, but I am really surprised by how inefficient it is to retrieve values from a Pandas.DataFrame. In the following toy example, even the DataFrame.iloc method is more than 100 times slower than a dictionary.

The question: Is the lesson here just that dictionaries are the better way to look up values? Yes, I get that that is precisely what they were made for. But I just wonder if there is something I am missing about DataFrame lookup performance.

I realize this question is more "musing" than "asking" but I will accept an answer that provides insight or perspective on this. Thanks.

import timeit

setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''

f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']

for func in f:
    print func
    print min(timeit.Timer(func, setup).repeat(3, 100000))

value = dictionary[5][5]

0.130625009537

value = df.loc[5, 5]

19.4681699276

value = df.iloc[5, 5]

17.2575249672

解决方案

A dict is to a DataFrame as a bicycle is to a car. You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.

Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.

import timeit

setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''

# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']

for func in f:
    print(func)
    print(min(timeit.Timer(func, setup).repeat(3, 100000)))

yields

value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426

So the dict of lists is 5 times slower at retrieving rows than df.iloc. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)

This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.

Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use

df.loc['2000-1-1':'2000-3-31']

There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.

这篇关于 pandas DataFrame表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆