使用 pandas ,找到两个DataFrame之间的相交区域? [英] Using pandas, find the intersecting regions between two DataFrames?
问题描述
我有两个使用python3.x的熊猫数据框:
I have two pandas Dataframes, using python3.x:
import pandas as pd
dict1 = {0:['chr1','chr1','chr1','chr1','chr2'],
1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100],
3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}
df1 = pd.DataFrame(dict1)
print(df1)
## 0 1 2 3 4 5
## 0 chr1 1 100 feature1 0 +
## 1 chr1 100 200 feature2 0 +
## 2 chr1 150 500 feature3 0 -
## 3 chr1 900 950 feature4 0 +
## 4 chr2 1 100 feature4 0 +
dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901],
3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}
df2 = pd.DataFrame(dict2)
print(df2)
## 0 1 2 3 4 5
## 0 chr1 155 200 feature5 0 -
## 1 chr1 800 901 feature6 0 +
在这些数据框中要重点关注的列是前三列:位置,开始和结束.每个start:end值代表位置上的距离(例如chr1
,chr2
,chr3
).
The columns to focus on in these dataframes are the first three columns: location, start, and end. Each start:end value represents a distance on location (e.g. chr1
, chr2
, chr3
).
我想输出df1
与df2
的交集.这是正确的输出:
I would like to output the intersection of df1
against df2
. Here is the correct output:
chr1 155 200 feature2 0 +
chr1 155 200 feature3 0 -
chr1 900 901 feature4 0 +
说明::我们找到了df1
与df2
的交集.因此,feature2
和feature3
在155至200处与df2
相交.feature4
在900至901处与df2
重叠.
Explanation: We find the intersection of df1
against df2
. So, feature2
and feature3
intersect df2
at 155 to 200. feature4
overlaps df2
at 900 to 901.
(在运行时和RAM方面)查找交叉点最有效的方法是什么?
What is the most efficient (in terms of runtime and RAM) to find the intersections?
这里有一个Python软件包,其功能与此类似: https://daler. github.io/pybedtools/intersections.html
There is a Python package which does something similar here: https://daler.github.io/pybedtools/intersections.html
推荐答案
import pandas as pd
df1 = pd.DataFrame({0:['chr1','chr1','chr1','chr1','chr2'],
1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100],
3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']})
df2 = pd.DataFrame({0:['chr1','chr1'], 1:[155, 800], 2:[200, 901],
3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']})
您可以使用apply
和一些逻辑测试来查找重叠.但是,您必须遍历染色体的组.您应该能够执行类似的操作来查找和修复需要调整的起点和终点.如果我以后有时间,我会为此写点东西.
You can use apply
and some logical tests to find overlaps. You'll have to loop over groups for the chromosomes though. You should be able to do something similar for finding and fixing the starts and stops that require adjustment. If I get time later I'll write something for it.
new_dfs = []
for chr_name, chr_df in df1.groupby(0):
chr_df2 = df2.loc[df2[0] == chr_name]
overlapping = (chr_df[1].apply(lambda x: chr_df2[2] >= x) & chr_df[2].apply(lambda x: chr_df2[1] <= x)).any(axis=1)
new_dfs.append(chr_df.loc[overlapping, :])
new_dfs = pd.concat(new_dfs)
总体而言,这将提高内存效率,但不是超级快.如果要快速创建索引,您可能必须编写一些复杂的内容.
Overall this will be memory efficient, but not super fast. You'd have to probably write something complicated for indexing if you wanted fast.
这篇关于使用 pandas ,找到两个DataFrame之间的相交区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!