在pandas DataFrame中有效地找到匹配的行(基于内容) [英] Efficiently find matching rows (based on content) in a pandas DataFrame

查看:361
本文介绍了在pandas DataFrame中有效地找到匹配的行(基于内容)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一些测试,并且正在使用Pandas DataFrames容纳一个大数据集〜(600,000 x 10)。我已经从源数据中提取了10个随机行(使用Stata),现在我想编写一个测试,看看这些行是否在我的测试套件的DataFrame中。

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.

作为一个小例子

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

此处 raw_data 是:

random_sample 派生来保证匹配,并且是:

And random_sample is derived to guarantee a match and is:

当前我已经写过:

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

这是可行的,但在大型数据集上是非常慢。有没有更有效的方法来检查DataFrame中是否包含整行?

Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?

BTW:我的示例还需要能够比较 np.NaN 相等,这就是为什么我使用 equals()方法

BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method

推荐答案

equals 似乎没有广播,但是我们总是可以手动进行相等比较:

equals doesn't seem to broadcast, but we can always do the equality comparison manually:

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242   

               7        8         9  
599999  0.426393  0.91632  0.569807  

比我的迭代版本快得多(需要30秒以上)。

which is much faster than the iterative version for me (which takes > 30s.)

但是由于我们有很多行和相对较少的列,所以我们可以在列上循环,通常情况下可能会大大减少要查看的行数。例如,类似

But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

给我

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

大约快一个数量级,因为在第一列之后仅剩一行。

which is roughly an order of magnitude faster, because after the first column there's only one row left.

(我想我曾经有过很多修身方法但对于我的一生,我现在已经不记得了。)

(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)

这篇关于在pandas DataFrame中有效地找到匹配的行(基于内容)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆