pandas 数据框检查交叉点并填写新数据框 [英] Pandas Dataframe check intersection and fill in a new dataframe

查看:72
本文介绍了 pandas 数据框检查交叉点并填写新数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个蛋白质序列列表,我必须检查两个列表中每个条目的存在,例如

I have two lists of protein sequences, I have to check every entry's existence in the two lists, say like

list A = [1,2,3,4]
list B= [3,4,5]

## just an example. The result would be convert into csv
result = [
[1, true, false],
[2, true, false],   ## 2 only exist in the first list
[3, true, true],    ## 3 exist in both lists
[4, true, true],
[5, false, true]
]

我将两个序列加载到两个不同的数据帧中,但是我不知道如何在数据帧中操作它们.我最终将它们加载到集合中并形成一个列表,然后再转换回数据框.我认为正确的方法应该是在数据帧中本机完成

I load the two sequences into two different dataframes but I can't figure out how to manipulate them within the dataframe. I ended up load them into a set and form a list then convert back into dataframe. I think the right way should be do it natively within the dataframes

def FindDifferences():    
    df1 = pd.read_csv('Gmax_v6_annotation_info.txt', names=['name'], usecols=[0], delimiter='\t')
    df2 = pd.read_csv('Gmax_v9_annotation_info.txt', names=['name'], usecols=[2], delimiter='\t')
    v6_set = set(df1['name'])
    v9_set = set(df2['name'])
    result = []
    for val in v6_set:
        if val in v9_set:
            result.append([val, True, True])
        else:
            result.append([val, True, False])
    for val in v9_set:
        if val not in v6_set:
            result.append([val, False, True])
    result_df = pd.DataFrame(result, columns=['name', 'inv6', 'inv9'])
    result_df.to_csv('result_csv.csv', index=False, header=False)
    return

我确实尝试过

new_dataframe.loc[new_dataframe.shape[0]] = [val, False, False]而不是附加到本机列表

new_dataframe.loc[new_dataframe.shape[0]] = [val, False, False] instead of appending to a native list

但是它是如此之慢,以至于我不得不削减执行力.使用list实现,甚至不需要一秒钟.

But it was so slow that I have to cut the execution. With the list implementation it takes not even a second.

推荐答案

您可以在merge启用indicator的情况下使用merge,这会创建一个 _merge 列,该列提供有关是否连接列存在于左侧或右侧或两个数据框中,然后可以从中创建两个指示列:

You can use merge with indicator turned on which creates a _merge column that gives information about whether the value in the join column exists in the left or right or both data frames, and then you can create two indication columns from it:

df1 = pd.DataFrame({'name': A})
df2 = pd.DataFrame({'name': B})

(df1.merge(df2, how='outer', indicator=True)
 .assign(inv6 = lambda x: x._merge != "right_only", 
         inv9 = lambda x: x._merge != "left_only")
 .drop("_merge", 1))

这篇关于 pandas 数据框检查交叉点并填写新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆