搜索排序是否比get_loc更快,以便在DataFrame索引中查找标签位置? [英] Is searchsorted faster than get_loc to find label location in a DataFrame Index?

查看:398
本文介绍了搜索排序是否比get_loc更快,以便在DataFrame索引中查找标签位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Pandas索引中找到标签的整数位置.我知道我可以使用get_loc方法,但是后来我发现了searchsorted.只是想知道我是否应该使用后者来提高速度,因为我需要搜索成千上万个标签.

I need to find the integer location for a label in a Pandas index. I know I can use get_loc method, but then I discovered searchsorted. Just wondering if I should use the latter for speed improvement, as I need to search for thousands of labels.

推荐答案

这取决于您的用例.以@ayhan为例.

It will depend on your usecase. using @ayhan's example.

使用get_loc,在第一次查找时创建哈希表的前期成本很高.

With get_loc there is a big upfront cost of creating the hash table on the first lookup.

In [22]: idx = pd.Index(['R{0:07d}'.format(i) for i in range(10**7)])
In [23]: to_search = np.random.choice(idx, 10**5, replace=False)
In [24]: %time idx.get_loc(to_search[0])
Wall time: 1.57 s

但是,后续查找可能会更快. (不保证,取决于数据)

But, subsequent lookups may be faster. (not guaranteed, depends on data)

In [9]: %%time
   ...: for i in to_search:
   ...:     idx.get_loc(i)
Wall time: 200 ms

In [10]: %%time
    ...: for i in to_search:
    ...:     np.searchsorted(idx, i)
Wall time: 486 ms

此外,正如Jeff指出的那样,保证get_loc总是可以工作,而searchsorted需要单调性(并且不检查).

Also, as Jeff noted, get_loc is guaranteed to always work, where searchsorted requires monotonicity (and doesn't check).

这篇关于搜索排序是否比get_loc更快,以便在DataFrame索引中查找标签位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆