使用 pandas ,找到两个DataFrame之间的相交区域? [英] Using pandas, find the intersecting regions between two DataFrames?

查看:171
本文介绍了使用 pandas ,找到两个DataFrame之间的相交区域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个使用python3.x的熊猫数据框:

I have two pandas Dataframes, using python3.x:

import pandas as pd

dict1 = {0:['chr1','chr1','chr1','chr1','chr2'], 
    1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100], 
    3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 
    4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}

df1 = pd.DataFrame(dict1)

print(df1)

##       0    1    2         3  4  5
## 0  chr1    1  100  feature1  0  +
## 1  chr1  100  200  feature2  0  +
## 2  chr1  150  500  feature3  0  -
## 3  chr1  900  950  feature4  0  +
## 4  chr2    1  100  feature4  0  +

dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901], 
    3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}

df2 = pd.DataFrame(dict2)
print(df2)
##       0    1    2         3  4  5
## 0  chr1  155  200  feature5  0  -
## 1  chr1  800  901  feature6  0  +

在这些数据框中要重点关注的列是前三列:位置,开始和结束.每个start:end值代表位置上的距离(例如chr1chr2chr3).

The columns to focus on in these dataframes are the first three columns: location, start, and end. Each start:end value represents a distance on location (e.g. chr1, chr2, chr3).

我想输出df1df2的交集.这是正确的输出:

I would like to output the intersection of df1 against df2. Here is the correct output:

chr1    155 200 feature2    0   +
chr1    155 200 feature3    0   -
chr1    900 901 feature4    0   +

说明::我们找到了df1df2的交集.因此,feature2feature3在155至200处与df2相交.feature4在900至901处与df2重叠.

Explanation: We find the intersection of df1 against df2. So, feature2 and feature3 intersect df2 at 155 to 200. feature4 overlaps df2 at 900 to 901.

(在运行时和RAM方面)查找交叉点最有效的方法是什么?

What is the most efficient (in terms of runtime and RAM) to find the intersections?

这里有一个Python软件包,其功能与此类似: https://daler. github.io/pybedtools/intersections.html

There is a Python package which does something similar here: https://daler.github.io/pybedtools/intersections.html

推荐答案

import pandas as pd

df1 = pd.DataFrame({0:['chr1','chr1','chr1','chr1','chr2'],
    1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100],
    3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
    4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']})

df2 = pd.DataFrame({0:['chr1','chr1'], 1:[155, 800], 2:[200, 901],
    3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']})

您可以使用apply和一些逻辑测试来查找重叠.但是,您必须遍历染色体的组.您应该能够执行类似的操作来查找和修复需要调整的起点和终点.如果我以后有时间,我会为此写点东西.

You can use apply and some logical tests to find overlaps. You'll have to loop over groups for the chromosomes though. You should be able to do something similar for finding and fixing the starts and stops that require adjustment. If I get time later I'll write something for it.

new_dfs = []

for chr_name, chr_df in df1.groupby(0):
    chr_df2 = df2.loc[df2[0] == chr_name]
    overlapping = (chr_df[1].apply(lambda x: chr_df2[2] >= x) & chr_df[2].apply(lambda x: chr_df2[1] <= x)).any(axis=1)
    new_dfs.append(chr_df.loc[overlapping, :])

new_dfs = pd.concat(new_dfs)

总体而言,这将提高内存效率,但不是超级快.如果要快速创建索引,您可能必须编写一些复杂的内容.

Overall this will be memory efficient, but not super fast. You'd have to probably write something complicated for indexing if you wanted fast.

这篇关于使用 pandas ,找到两个DataFrame之间的相交区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆