pandas 标量值获取和设置:ix还是iat? [英] Pandas scalar value getting and setting: ix or iat?
问题描述
我正在试图找出何时在pandas DataFrame中使用不同的选择方法。特别是,我正在寻找访问标量值。我经常听到 ix
通常被推荐。但是在 pandas文档
建议在和 iat
使用进行快速标量值访问:
I'm trying to figure out when to use different selecting methods in pandas DataFrame. In particular, I'm looking for accessing scalar values. I often hear ix
being generally recommended. But in pandas documentation
it's recommended to use at
and iat
for fast scalar value accessing:
由于使用[]进行索引必须处理很多情况(单标签访问,切片,布尔索引等),因此它有一些开销找出你要求的东西。如果您只想访问标量值,最快的方法是在
和iat方法中使用
,这些方法在所有数据结构上实现。
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the
atand iat methods, which are implemented on all of the data structures.
因此,我认为 iat
应该更快获取和设置单个单元格。但是,经过一些测试,我们发现 ix
与读取单元格相比或更快,而 iat
更快为单元格分配值。
So, I would assume iat
should be faster for getting and setting individual cells. However, after some tests, we found that ix
would be comparable or faster for reading cells, while iat
much faster for assigning values to cells.
这种行为是否记录在任何地方?它总是如此,为什么会发生这种情况?是否必须对返回视图或复制执行某些操作?如果有人能对这个问题有所了解并解释建议获取和设置单元格值以及原因,我将不胜感激。
Is this behavior documented anywhere? Is it always the case and why does this happen? Does it have to do something with returning view or copy? I would appreciate if someone could put any light on this question and explain what is recommended for getting and setting cell values and why.
以下是一些使用pandas的测试(版本0.15 0.2)。
Here are some tests using pandas (version 0.15.2).
为了确保此行为不是此版本的错误,我还在0.11.0上测试了它。我没有提供结果,但趋势完全相同 - ix获取速度要快得多,而iat用于设置单个单元格
。
Just to make sure that this behavior is not a bug of this version, I also tested it on 0.11.0. I do not provide the results, but the trend is exactly the same - ix being much faster for getting, and iat for setting individual cells
.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
timeit for i in range(1000): df.ix[i,'A'] = 1
timeit for i in range(1000): df.iat[i,idx] = 2
>> 10 loops, best of 3: 92.6 ms per loop
>> 10 loops, best of 3: 21.7 ms per loop
timeit for i in range(1000): tmp = df.ix[i,'A']
timeit for i in range(1000): tmp = df.iat[i,idx]
>> 100 loops, best of 3: 5.31 ms per loop
>> 10 loops, best of 3: 19.4 ms per loop
推荐答案
Pandas使用索引类做了一些非常有趣的事情。我认为我无法描述一种简单的方法来了解使用哪种方法,但我可以对实施提供一些见解。
Pandas does some pretty interesting things with the indexing classes. I don't think I am capable of describing a simple way to know which to use but I can give some insight on the implementation.
DataFrame#ix
是 _IXIndexer
声明自己的 __ getitem __
或 __ setitem __
。这两种方法很重要,因为它们控制着如何使用Pandas访问值。由于 _IXIndexer
未声明这些方法超类 _NDFrameIndexer
。
DataFrame#ix
is an _IXIndexer
which does not declare its own __getitem__
or __setitem__
. These two methods are important because they control how values are accessed with Pandas. Since _IXIndexer
does not declare these methods the super class _NDFrameIndexer
's are used instead.
进一步挖掘 _NDFrameIndexer
的 __ getitem __
表明它相对简单,在某些情况下包装 的get_value
。然后 __ getitem __
在某些情况下接近 get_value
的速度。
Further digging on the _NDFrameIndexer
's __getitem__
shows that it is relatively simple and in some cases wraps the logic found in get_value
. Then __getitem__
is close to as fast as get_value
for some scenarios.
_NDFrameIndexer
's __ setitem __
是另一回事。起初它看起来很简单,但它调用的第二种方法是 _setitem_with_indexer
这对大多数情况都做了大量的工作。
_NDFrameIndexer
's __setitem__
is a different story. At first it looks simple but the second method it calls is _setitem_with_indexer
which does a considerable amount of work for most scenarios.
此信息表明使用<$ c获取值的调用在最好的情况下,$ c> ix 受限于 get_value
并使用 ix $ c $调用设定值c>需要核心提交者来解释。
This information suggests that calls to get values using ix
are limited by get_value
in the best case and calls to set values using ix
would take a core committer to explain.
现在 DataFrame #atat
这是 _iAtIndexer
也没有宣布自己的 __ getitem __
或 __ setitem __
因此回到超级类 _ScalarAccessIndexer
的实施。
Now for DataFrame#iat
which is an _iAtIndexer
which also doesn't declare its own __getitem__
or __setitem__
therefor falling back to its super class _ScalarAccessIndexer
's implementation.
_ScalarAccessIndexer
有简单的 __ getitem __
实现,但它需要一个循环才能将密钥转换为正确的格式。附加循环在调用 get_value
之前会增加一些额外的处理时间。
_ScalarAccessIndexer
has a simple __getitem__
implementation but it requires a loop in order to convert the key into the proper format. The additional loop adds some extra processing time before calling get_value
.
_ScalarAccessIndexer
还有一个相当的简单 __ setitem __
实现,在设置值之前将参数 set_value
转换为密钥。
_ScalarAccessIndexer
also has a fairly simple __setitem__
implementation which converts the key the parameters set_value
requires before setting the value.
此信息表明使用 iat
获取值的调用受限于 get_value
以及 for循环。使用 iat
设置值主要受限于对 set_value
的调用。因此,使用 iat
获取值会产生一些开销,而设置它们的开销会更小。
This information suggests that calls to get values using iat
are limited by get_value
as well as a for loop. Setting values with iat
are primarily limited by calls to set_value
. So getting values with iat
has a bit of an overhead, while setting them has a smaller overhead.
TL; DR
我相信您使用正确的访问者来获取 Int64Index
索引文档,但我不认为这意味着它是最快的。可以使用 get_value
和 set_value
直接找到最佳性能,但是他们需要对Pandas DataFrames的更多深入了解实施。
I believe you are using the correct accessor for an Int64Index
index based on the documentation but I don't think that means it is the fastest. The best performance can be found using get_value
and set_value
directly but they require an extra depth of knowledge in how Pandas DataFrames are implemented.
备注
值得注意的是,有关熊猫的文档提及不推荐使用 get_value
和 set_value
我认为这是 iget_value
。
It is worth noting that the documentation on Pandas mentions that get_value
and set_value
are deprecated which I believe was meant to be iget_value
instead.
示例
为了显示使用少数索引器的性能差异(包括直接调用 get_value
和 set_value
)我创建了这个脚本:
In order to show the difference in performance using a few indexers (including directly calling get_value
and set_value
) I made this script:
示例.py
:
import timeit
def print_index_speed(stmnt_name, stmnt):
"""
Repeatedly run the statement provided then repeat the process and take the
minimum execution time.
"""
setup = """
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
"""
minimum_execution_time = min(
timeit.Timer(stmnt, setup=setup).repeat(5, 10))
print("{stmnt_name}: {time}".format(
stmnt_name=stmnt_name,
time=round(minimum_execution_time, 5)))
print_index_speed("set ix", "for i in range(1000): df.ix[i, 'A'] = 1")
print_index_speed("set at", "for i in range(1000): df.at[i, 'A'] = 2")
print_index_speed("set iat", "for i in range(1000): df.iat[i, idx] = 3")
print_index_speed("set loc", "for i in range(1000): df.loc[i, 'A'] = 4")
print_index_speed("set iloc", "for i in range(1000): df.iloc[i, idx] = 5")
print_index_speed(
"set_value scalar",
"for i in range(1000): df.set_value(i, idx, 6, True)")
print_index_speed(
"set_value label",
"for i in range(1000): df.set_value(i, 'A', 7, False)")
print_index_speed("get ix", "for i in range(1000): tmp = df.ix[i, 'A']")
print_index_speed("get at", "for i in range(1000): tmp = df.at[i, 'A']")
print_index_speed("get iat", "for i in range(1000): tmp = df.iat[i, idx]")
print_index_speed("get loc", "for i in range(1000): tmp = df.loc[i, 'A']")
print_index_speed("get iloc", "for i in range(1000): tmp = df.iloc[i, idx]")
print_index_speed(
"get_value scalar",
"for i in range(1000): tmp = df.get_value(i, idx, True)")
print_index_speed(
"get_value label",
"for i in range(1000): tmp = df.get_value(i, 'A', False)")
输出:
set ix: 0.9918
set at: 0.06801
set iat: 0.08606
set loc: 1.04173
set iloc: 1.0021
set_value: 0.0452
**set_value**: 0.03516
get ix: 0.04827
get at: 0.06889
get iat: 0.07813
get loc: 0.8966
get iloc: 0.87484
get_value: 0.04994
**get_value**: 0.03111
这篇关于 pandas 标量值获取和设置:ix还是iat?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!