在 pandas 中建立索引有什么意义? [英] What is the point of indexing in pandas?
问题描述
有人可以指向我的链接或解释在 Pandas 中建立索引的好处吗?我经常处理表并根据列连接它们,而且这个连接/合并过程似乎无论如何都会重新索引事物,因此考虑到我认为我不需要应用索引标准有点麻烦.
Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.
关于索引的最佳实践有什么想法吗?
Any thoughts on best-practices around indexing?
推荐答案
就像字典一样,DataFrame 的索引由哈希表支持.查找行基于索引值就像根据键查找字典值一样.
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.
相比之下,列中的值就像列表中的值.
In contrast, the values in a column are like values in a list.
根据索引值查找行比根据列值查找行要快.
Looking up rows based on index values is faster than looking up rows based on column values.
例如,考虑
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
以下是查找 df['index']
列等于 999 的任何行的方法.Pandas 必须遍历列中的每个值才能找到等于 999 的值.
Here is how you could look up any row where the df['index']
column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999]
# foo index
# 999 0.375489 999
以下是查找索引等于 999 的任何行的方法.使用索引,Pandas 使用哈希值查找行:
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
按索引查找行比按列值查找行要快得多:
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
但是请注意,构建索引需要时间:
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
所以只有当你有很多这种类型的查找时,拥有索引才有好处执行.
So having the index is only advantageous when you have many lookups of this type to perform.
有时索引会在重塑 DataFrame 中发挥作用.许多函数,如set_index
、stack
、unstack
、pivot
、pivot_table
、融化
,lreshape
和 crosstab
都使用或操作索引.有时我们希望 DataFrame 具有不同的形状以用于演示目的,或者用于 join
、merge
或 groupby
操作.(正如您所注意到的,也可以基于列值进行连接,但基于索引的连接速度更快.)在幕后,join
、merge
和 groupby
尽可能利用快速索引查找.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
,
lreshape
, and crosstab
, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
时间序列有 resample
、asfreq
和 interpolate
方法,它们的底层实现也利用了快速索引查找.
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
所以最后,我认为索引有用的起源,为什么它出现在这么多函数中,是因为它能够执行快速散列查找.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.
这篇关于在 pandas 中建立索引有什么意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!