Pandas DataFrame 搜索是线性时间还是常数时间? [英] Pandas DataFrame search is linear time or constant time?

查看：24 发布时间：2021/12/20 14:01:09 python pandas search dataframe time-complexity

本文介绍了Pandas DataFrame 搜索是线性时间还是常数时间?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个超过 15000 行的数据框对象 df，例如:

I have a dataframe object df of over 15000 rows like:

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

我正在尝试找到具有特定动漫 ID 的行.

And I am trying to find the row with a particular anime_id.

a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)

我只是想知道这个搜索是在恒定时间(如字典)还是线性时间(如列表)内完成的.

I just wanted to know if this search was done in constant time (like dictionaries) or linear time(like lists).

推荐答案

这是一个很有趣的问题！

This is a very interesting question!

我认为这取决于以下几个方面:

I think it depends on the following aspects:

按索引访问单行(索引已排序且唯一)应具有运行时O(m)，其中m <<


accessing single row by index (index is sorted and unique) should have runtime O(m) where m << n_rows
按索引访问单行(索引不是唯一的并且没有排序)应该有运行时O(n_rows)
accessing single row by index (index is NOT unique and is NOT sorted) should have runtime O(n_rows)
按索引访问单行(索引不是唯一的并且已排序)应该有运行时O(m)，其中m )

accessing single row by index (index is NOT unique and is sorted) should have runtime O(m) where m < n_rows)
通过布尔索引访问行(独立于索引)应该有运行时O(n_rows)
accessing row(s) (independently of an index) by boolean indexing should have runtime O(n_rows)
演示:
索引已排序且唯一:
In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))

In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 µs per loop

In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 µs per loop

In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop

In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop

索引未排序且不唯一:
In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))

In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop

In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop

In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop

索引不是唯一的并且已排序:
In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 µs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop


                        这篇关于Pandas DataFrame 搜索是线性时间还是常数时间?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Pandas DataFrame 搜索是线性时间还是常数时间? [英] Pandas DataFrame search is linear time or constant time?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas DataFrame 搜索是线性时间还是常数时间? [英] Pandas DataFrame search is linear time or constant time?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭