在 pandas 中建立索引有什么意义? [英] What is the point of indexing in pandas?

查看:54
本文介绍了在 pandas 中建立索引有什么意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以指向我的链接或解释在 Pandas 中建立索引的好处吗?我经常处理表并根据列连接它们,而且这个连接/合并过程似乎无论如何都会重新索引事物,因此考虑到我认为我不需要应用索引标准有点麻烦.

Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.

关于索引的最佳实践有什么想法吗?

Any thoughts on best-practices around indexing?

推荐答案

就像字典一样,DataFrame 的索引由哈希表支持.查找行基于索引值就像根据键查找字典值一样.

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.

相比之下,列中的值就像列表中的值.

In contrast, the values in a column are like values in a list.

根据索引值查找行比根据列值查找行要快.

Looking up rows based on index values is faster than looking up rows based on column values.

例如,考虑

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])

以下是查找 df['index'] 列等于 999 的任何行的方法.Pandas 必须遍历列中的每个值才能找到等于 999 的值.

Here is how you could look up any row where the df['index'] column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.

df[df['index'] == 999]

#           foo  index
# 999  0.375489    999

以下是查找索引等于 999 的任何行的方法.使用索引,Pandas 使用哈希值查找行:

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

df_with_index.loc[999]
# foo        0.375489
# index    999.000000
# Name: 999, dtype: float64

按索引查找行比按列值查找行要快得多:

Looking up rows by index is much faster than looking up rows by column value:

In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop

In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop

但是请注意,构建索引需要时间:

Note however, it takes time to build the index:

In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop

所以只有当你有很多这种类型的查找时,拥有索引才有好处执行.

So having the index is only advantageous when you have many lookups of this type to perform.

有时索引会在重塑 DataFrame 中发挥作用.许多函数,如set_indexstackunstackpivotpivot_table融化,lreshapecrosstab 都使用或操作索引.有时我们希望 DataFrame 具有不同的形状以用于演示目的,或者用于 joinmergegroupby 操作.(正如您所注意到的,也可以基于列值进行连接,但基于索引的连接速度更快.)在幕后,joinmergegroupby 尽可能利用快速索引查找.

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt, lreshape, and crosstab, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

时间序列有 resampleasfreqinterpolate 方法,它们的底层实现也利用了快速索引查找.

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

所以最后,我认为索引有用的起源,为什么它出现在这么多函数中,是因为它能够执行快速散列查找.

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.

这篇关于在 pandas 中建立索引有什么意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆