pandas 标量值获取和设置:ix还是iat? [英] Pandas scalar value getting and setting: ix or iat?

查看:150
本文介绍了 pandas 标量值获取和设置:ix还是iat?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试图找出何时在pandas DataFrame中使用不同的选择方法。特别是,我正在寻找访问标量值。我经常听到 ix 通常被推荐。但是在 pandas文档
建议在和 iat 使用进行快速标量值访问:

I'm trying to figure out when to use different selecting methods in pandas DataFrame. In particular, I'm looking for accessing scalar values. I often hear ix being generally recommended. But in pandas documentation it's recommended to use at and iat for fast scalar value accessing:

由于使用[]进行索引必须处理很多情况(单标签访问,切片,布尔索引等),因此它有一些开销找出你要求的东西。如果您只想访问标量值,最快的方法是在和iat方法中使用,这些方法在所有数据结构上实现。

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use theatand iat methods, which are implemented on all of the data structures.

因此,我认为 iat 应该更快获取和设置单个单元格。但是,经过一些测试,我们发现 ix 与读取单元格相比或更快,而 iat 更快为单元格分配值。

So, I would assume iat should be faster for getting and setting individual cells. However, after some tests, we found that ix would be comparable or faster for reading cells, while iat much faster for assigning values to cells.

这种行为是否记录在任何地方?它总是如此,为什么会发生这种情况?是否必须对返回视图或复制执行某些操作?如果有人能对这个问题有所了解并解释建议获取和设置单元格值以及原因,我将不胜感激。

Is this behavior documented anywhere? Is it always the case and why does this happen? Does it have to do something with returning view or copy? I would appreciate if someone could put any light on this question and explain what is recommended for getting and setting cell values and why.

以下是一些使用pandas的测试(版本0.15 0.2)。

Here are some tests using pandas (version 0.15.2).

为了确保此行为不是此版本的错误,我还在0.11.0上测试了它。我没有提供结果,但趋势完全相同 - ix获取速度要快得多,而iat用于设置单个单元格

Just to make sure that this behavior is not a bug of this version, I also tested it on 0.11.0. I do not provide the results, but the trend is exactly the same - ix being much faster for getting, and iat for setting individual cells.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0

timeit for i in range(1000): df.ix[i,'A'] = 1
timeit for i in range(1000): df.iat[i,idx] = 2

>> 10 loops, best of 3: 92.6 ms per loop
>> 10 loops, best of 3: 21.7 ms per loop

timeit for i in range(1000): tmp = df.ix[i,'A'] 
timeit for i in range(1000): tmp = df.iat[i,idx] 

>> 100 loops, best of 3: 5.31 ms per loop
>> 10 loops, best of 3: 19.4 ms per loop


推荐答案

Pandas使用索引类做了一些非常有趣的事情。我认为我无法描述一种简单的方法来了解使用哪种方法,但我可以对实施提供一些见解。

Pandas does some pretty interesting things with the indexing classes. I don't think I am capable of describing a simple way to know which to use but I can give some insight on the implementation.

DataFrame#ix _IXIndexer 声明自己的 __ getitem __ __ setitem __ 。这两种方法很重要,因为它们控制着如何使用Pandas访问值。由于 _IXIndexer 未声明这些方法超类 _NDFrameIndexer

DataFrame#ix is an _IXIndexer which does not declare its own __getitem__ or __setitem__. These two methods are important because they control how values are accessed with Pandas. Since _IXIndexer does not declare these methods the super class _NDFrameIndexer's are used instead.

进一步挖掘 _NDFrameIndexer __ getitem __ 表明它相对简单,在某些情况下包装 的get_value 。然后 __ getitem __ 在某些情况下接近 get_value 的速度。

Further digging on the _NDFrameIndexer's __getitem__ shows that it is relatively simple and in some cases wraps the logic found in get_value. Then __getitem__ is close to as fast as get_value for some scenarios.

_NDFrameIndexer 's __ setitem __ 是另一回事。起初它看起来很简单,但它调用的第二种方法是 _setitem_with_indexer 这对大多数情况都做了大量的工作。

_NDFrameIndexer's __setitem__ is a different story. At first it looks simple but the second method it calls is _setitem_with_indexer which does a considerable amount of work for most scenarios.

此信息表明使用<$ c获取值的调用在最好的情况下,$ c> ix 受限于 get_value 并使用 ix 需要核心提交者来解释。

This information suggests that calls to get values using ix are limited by get_value in the best case and calls to set values using ix would take a core committer to explain.

现在 DataFrame #atat 这是 _iAtIndexer 也没有宣布自己的 __ getitem __ __ setitem __ 因此回到超级类 _ScalarAccessIndexer 的实施。

Now for DataFrame#iat which is an _iAtIndexer which also doesn't declare its own __getitem__ or __setitem__ therefor falling back to its super class _ScalarAccessIndexer's implementation.

_ScalarAccessIndexer 简单的 __ getitem __ 实现,但它需要一个循环才能将密钥转换为正确的格式。附加循环在调用 get_value 之前会增加一些额外的处理时间。

_ScalarAccessIndexer has a simple __getitem__ implementation but it requires a loop in order to convert the key into the proper format. The additional loop adds some extra processing time before calling get_value.

_ScalarAccessIndexer 还有一个相当的简单 __ setitem __ 实现,在设置值之前将参数 set_value 转换为密钥。

_ScalarAccessIndexer also has a fairly simple __setitem__ implementation which converts the key the parameters set_value requires before setting the value.

此信息表明使用 iat 获取值的调用受​​限于 get_value 以及 for循环。使用 iat 设置值主要受限于对 set_value 的调用。因此,使用 iat 获取值会产生一些开销,而设置它们的开销会更小。

This information suggests that calls to get values using iat are limited by get_value as well as a for loop. Setting values with iat are primarily limited by calls to set_value. So getting values with iat has a bit of an overhead, while setting them has a smaller overhead.

TL; DR

我相信您使用正确的访问者来获取 Int64Index 索引文档,但我不认为这意味着它是最快的。可以使用 get_value set_value 直接找到最佳性能,但是他们需要对Pandas DataFrames的更多深入了解实施。

I believe you are using the correct accessor for an Int64Index index based on the documentation but I don't think that means it is the fastest. The best performance can be found using get_value and set_value directly but they require an extra depth of knowledge in how Pandas DataFrames are implemented.

备注

值得注意的是,有关熊猫的文档提及不推荐使用 get_value set_value 我认为这是 iget_value

It is worth noting that the documentation on Pandas mentions that get_value and set_value are deprecated which I believe was meant to be iget_value instead.

示例

为了显示使用少数索引器的性能差异(包括直接调用 get_value set_value )我创建了这个脚本:

In order to show the difference in performance using a few indexers (including directly calling get_value and set_value) I made this script:

示例.py

import timeit


def print_index_speed(stmnt_name, stmnt):
    """
    Repeatedly run the statement provided then repeat the process and take the
    minimum execution time.
    """
    setup = """
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
    """

    minimum_execution_time = min(
        timeit.Timer(stmnt, setup=setup).repeat(5, 10))

    print("{stmnt_name}: {time}".format(
        stmnt_name=stmnt_name,
        time=round(minimum_execution_time, 5)))

print_index_speed("set ix", "for i in range(1000): df.ix[i, 'A'] = 1")
print_index_speed("set at", "for i in range(1000): df.at[i, 'A'] = 2")
print_index_speed("set iat", "for i in range(1000): df.iat[i, idx] = 3")
print_index_speed("set loc", "for i in range(1000): df.loc[i, 'A'] = 4")
print_index_speed("set iloc", "for i in range(1000): df.iloc[i, idx] = 5")
print_index_speed(
    "set_value scalar",
    "for i in range(1000): df.set_value(i, idx, 6, True)")
print_index_speed(
    "set_value label",
    "for i in range(1000): df.set_value(i, 'A', 7, False)")

print_index_speed("get ix", "for i in range(1000): tmp = df.ix[i, 'A']")
print_index_speed("get at", "for i in range(1000): tmp = df.at[i, 'A']")
print_index_speed("get iat", "for i in range(1000): tmp = df.iat[i, idx]")
print_index_speed("get loc", "for i in range(1000): tmp = df.loc[i, 'A']")
print_index_speed("get iloc", "for i in range(1000): tmp = df.iloc[i, idx]")
print_index_speed(
    "get_value scalar",
    "for i in range(1000): tmp = df.get_value(i, idx, True)")
print_index_speed(
    "get_value label",
    "for i in range(1000): tmp = df.get_value(i, 'A', False)")

输出:

set ix: 0.9918
set at: 0.06801
set iat: 0.08606
set loc: 1.04173
set iloc: 1.0021
set_value: 0.0452
**set_value**: 0.03516
get ix: 0.04827
get at: 0.06889
get iat: 0.07813
get loc: 0.8966
get iloc: 0.87484
get_value: 0.04994
**get_value**: 0.03111

这篇关于 pandas 标量值获取和设置:ix还是iat?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆