在 Pandas DataFrame 中查找(仅)满足给定条件的第一行 [英] Find (only) the first row satisfying a given condition in pandas DataFrame
问题描述
我有一个数据框 df
,其中有一列很长的随机正整数:
I have a dataframe df
with a very long column of random positive integers:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})
我想确定列中第一个偶数的索引.一种方法是:
I want to determine the index of the first even number in the column. One way to do this is:
df[df.n % 2 == 0].iloc[0]
但这涉及很多操作(生成索引fn % 2 == 0
,对这些索引求值df
,最后取第一项)并且非常慢的.像这样的循环要快得多:
but this involves a lot of operations (generate the indices f.n % 2 == 0
, evaluate df
on those indices and finally take the first item) and is very slow. A loop like this is much quicker:
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
还因为第一个结果可能在前几行.有没有类似性能的熊猫方法来做到这一点?谢谢.
also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.
注意:此条件(为偶数)只是一个示例.我正在寻找一种适用于任何类型的值条件的解决方案,即快速单行替代:
NOTE: This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:
df[ conditions on df.n ].iloc[0]
推荐答案
为了好玩,我决定尝试一些可能性.我拿了一个数据框:
I decided for fun to play with a few possibilities. I take a dataframe:
MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})
(这次不是随机的.)对于 N
的某个值,我想找到 n >= N
的第一行.我计时了以下四个版本:
(not random this time.) I want to find the first row for which n >= N
for some value of N
. I have timed the following four versions:
def getfirst_pandas(condition, df):
return df[condition(df)].iloc[0]
def getfirst_iterrows_loop(condition, df):
for index, row in df.iterrows():
if condition(row):
return index, row
return None
def getfirst_for_loop(condition, df):
for j in range(len(df)):
if condition(df.iloc[j]):
break
return j
def getfirst_numpy_argmax(condition, df):
array = df.as_matrix()
imax = np.argmax(condition(array))
return df.index[imax]
with N
= 十的幂.当然,numpy(优化的 C)代码预计比 python 中的 for
循环更快,但我想看看 N
python 循环的哪些值仍然可以.
with N
= powers of ten. Of course the numpy (optimized C) code is expected to be faster than for
loops in python, but I wanted to see for which values of N
python loops are still okay.
我为台词计时:
getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)
对于 N = 1, 10, 100, 1000, ...
.这是性能的日志对数图:
for N = 1, 10, 100, 1000, ...
. This is the log-log graph of the performance:
简单的 for
循环就可以了,只要第一个 True 位置"预期在开头,但随后就变得糟糕了.np.argmax
是最安全的解决方案.
The simple for
loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax
is the safest solution.
从图中可以看出,pandas
和 argmax
的时间(几乎)保持不变,因为它们总是扫描整个数组.最好有一个没有的 np
或 pandas
方法.
As you can see from the graph, the time for pandas
and argmax
remain (almost) constant, because they always scan the whole array. It would be perfect to have a np
or pandas
method which doesn't.
这篇关于在 Pandas DataFrame 中查找(仅)满足给定条件的第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!