为什么DataFrame.loc [[1]]比df.ix [[1]]慢1,800倍,比df.loc [1]慢3,500倍? [英] Why is DataFrame.loc[[1]] 1,800x slower than df.ix [[1]] and 3,500x than df.loc[1]?

查看:333
本文介绍了为什么DataFrame.loc [[1]]比df.ix [[1]]慢1,800倍,比df.loc [1]慢3,500倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲自尝试一下:

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

更新:这是熊猫中的合法错误它可能是在2014年8月左右在0.15.1中引入的.解决方法:使用旧版本的熊猫时,请等待新版本的发布;获得最前沿的开发者.来自github的版本;在pandas版本中手动进行单行修改;暂时使用.ix代替.loc.

Update: that is a legitimate bug in pandas that was probably introduced in 0.15.1 in August, 2014 or so. Workarounds: wait for a new release while using an old version of pandas; get a cutting-edge dev. version from github; manually do a one-line modification in your release of pandas; temporarily use .ix instead of .loc.

我有一个480万行的DataFrame,然后使用.iloc[[ id ]](带有一个单元素列表)选择一个单行花费489毫秒,几乎是半秒,比慢1800倍.相同的.ix[[ id ]],并且比 .iloc[id]慢<3,500倍(将ID作为值而不是列表传递).公平地说,无论列表的长度如何,.loc[list]大约花费相同的时间,但是我不想在它上花费 489 ms ,尤其是当.ix快一千倍时,并产生相同的结果.据我了解,.ix应该比较慢,不是吗?

I have a DataFrame with 4.8 million rows, and selecting a single row using .iloc[[ id ]](with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .ix[[ id ]], and 3,500x times slower than .iloc[id] (passing the id as a value, not as a list). To be fair, .loc[list] takes about the same time regardless of the length of the list, but I don't want to spend 489 ms on it, especially when .ix is a thousand times faster, and produces identical result. It was my understanding that .ix was supposed to be slower, wasn't it?

我正在使用熊猫0.15.1.关于索引和选择数据的出色教程表明.ix是某种方式比.loc.iloc更通用,并且可能更慢.具体说,

I am using pandas 0.15.1. The excellent tutorial on Indexing and Selecting Data suggests that .ix is somehow more general, and presumably slower, than .loc and .iloc. Specifically, it says

但是,当轴是基于整数的时,仅基于标签的访问和 不支持位置访问.因此,在这种情况下,通常 最好明确一些,并使用.iloc或.loc.

However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

这是一个基准测试的iPython会话:

Here is an iPython session with the benchmarks:

    print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
    print 'df.index begins with ', df.index[:20]
    print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())

    # First extract one element directly. Expected result, no issues here.
    id=5965356
    print 'Extract one element with id %d' % id
    %timeit df.loc[id]
    %timeit df.ix[id]
    print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result

    # Now extract this one element as a list.
    %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
    %timeit df.ix[[id]] 
    print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]]))  # this one should be True
    # Let's double-check that in this case .ix is the same as .loc, not .iloc, 
    # as this would explain the difference.
    try:
        print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
    except:
        print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))

    # Finally, for the sake of completeness, let's take a look at iloc
    %timeit df.iloc[3456789]    # this is still 100+ times faster than the next version
    %timeit df.iloc[[3456789]]

输出:

The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with  Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop

推荐答案

看起来问题不在熊猫0.14中.我使用 line_profiler 对其进行了分析,我想我知道发生了什么.由于熊猫0.15.1,如果不存在给定索引,则现在引发KeyError.看起来,当您使用.loc[list]语法时,即使已找到索引,它也会沿整个轴进行详尽的搜索.也就是说,首先,如果发现元素,则不会提前终止;其次,在这种情况下,搜索是蛮力的.

Looks like the issue was not present in pandas 0.14. I profiled it with line_profiler, and I think I know what has happened. Since pandas 0.15.1, a KeyError is now raised if a given index is not present. Looks like when you are using the .loc[list] syntax, it is doing an exhaustive search for an index along the entire axis, even if it has been found. That is, first, there is no early termination in case an element is found and, second, the search in this case is brute-force.

File: .../anaconda/lib/python2.7/site-packages/pandas/core/indexing.py

  1278                                                       # require at least 1 element in the index
  1279         1          241    241.0      0.1              idx = _ensure_index(key)
  1280         1       391040 391040.0     99.9              if len(idx) and not idx.isin(ax).any():
  1281                                           
  1282                                                           raise KeyError("None of [%s] are in the [%s]" %

这篇关于为什么DataFrame.loc [[1]]比df.ix [[1]]慢1,800倍,比df.loc [1]慢3,500倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆