在 Pandas DataFrame 中查找(仅)满足给定条件的第一行 [英] Find (only) the first row satisfying a given condition in pandas DataFrame

查看:158
本文介绍了在 Pandas DataFrame 中查找(仅)满足给定条件的第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框 df,其中有一列很长的随机正整数:

I have a dataframe df with a very long column of random positive integers:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

我想确定列中第一个偶数的索引.一种方法是:

I want to determine the index of the first even number in the column. One way to do this is:

df[df.n % 2 == 0].iloc[0]

但这涉及很多操作(生成索引fn % 2 == 0,对这些索引求值df,最后取第一项)并且非常慢的.像这样的循环要快得多:

but this involves a lot of operations (generate the indices f.n % 2 == 0, evaluate df on those indices and finally take the first item) and is very slow. A loop like this is much quicker:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

还因为第一个结果可能在前几行.有没有类似性能的熊猫方法来做到这一点?谢谢.

also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.

注意:此条件(为偶数)只是一个示例.我正在寻找一种适用于任何类型的值条件的解决方案,即快速单行替代:

NOTE: This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:

df[ conditions on df.n ].iloc[0]

推荐答案

为了好玩,我决定尝试一些可能性.我拿了一个数据框:

I decided for fun to play with a few possibilities. I take a dataframe:

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

(这次不是随机的.)对于 N 的某个值,我想找到 n >= N 的第一行.我计时了以下四个版本:

(not random this time.) I want to find the first row for which n >= N for some value of N. I have timed the following four versions:

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

with N = 十的幂.当然,numpy(优化的 C)代码预计比 python 中的 for 循环更快,但我想看看 N python 循环的哪些值仍然可以.

with N = powers of ten. Of course the numpy (optimized C) code is expected to be faster than for loops in python, but I wanted to see for which values of N python loops are still okay.

我为台词计时:

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

对于 N = 1, 10, 100, 1000, ....这是性能的日志对数图:

for N = 1, 10, 100, 1000, .... This is the log-log graph of the performance:

图片

简单的 for 循环就可以了,只要第一个 True 位置"预期在开头,但随后就变得糟糕了.np.argmax 是最安全的解决方案.

The simple for loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax is the safest solution.

从图中可以看出,pandasargmax 的时间(几乎)保持不变,因为它们总是扫描整个数组.最好有一个没有的 nppandas 方法.

As you can see from the graph, the time for pandas and argmax remain (almost) constant, because they always scan the whole array. It would be perfect to have a np or pandas method which doesn't.

这篇关于在 Pandas DataFrame 中查找(仅)满足给定条件的第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆