在多个条件下合并数据帧 - 不是特别针对相等的值 [英] Merging DataFrames on multiple conditions - not specifically on equal values
问题描述
首先,对不起,如果这有点冗长,但我想完整描述我遇到的问题以及我已经尝试过的内容.
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
我正在尝试在多个条件下将两个数据帧对象连接(合并)在一起.如果要满足的条件都是等于"运算符,我知道如何执行此操作,但是,我需要使用 LESS THAN 和 MORE THAN.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
数据框代表遗传信息:一个是基因组中的突变列表(称为 SNP),另一个提供有关基因在人类基因组上的位置的信息.对这些执行 df.head() 返回以下内容:
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
这会显示 SNP 参考 ID 及其位置.BP"代表Base-Pair"位置.
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
这个数据框显示了所有感兴趣基因的位置.
This dataframe shows the locations of all the genes of interest.
我想找出属于基因组中基因区域内的所有 SNP,并丢弃这些区域之外的 SNP.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
如果我想根据多个(等于)条件将两个数据帧合并在一起,我会执行以下操作:
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
但是,在这种情况下 - 我需要找到染色体值与 Gene 数据框中的值匹配的 SNP,并且 BP 值介于chr_start"和chr_stop"之间.具有挑战性的是这些数据框非常大.在当前数据集中,snp_df 有 6795021 行,gene_df 有 34362.
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
我试图通过分别查看染色体或基因来解决这个问题.由于未使用性染色体,因此有 22 个不同的染色体值(整数 1-22).这两种方法都需要很长时间.一种使用 pandasql
模块,而另一种方法是遍历单独的基因.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql
module, while the other approach is to loop through the separate genes.
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
基因迭代法
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
任何人都可以就更有效的方法提出任何建议吗?
Can anyone give any suggestions of a more effective way of doing this?
推荐答案
我刚刚想到了一个方法来解决这个问题——结合我的两种方法:
I've just thought of a way to solve this - by combining my two methods:
首先,关注单个染色体,然后遍历这些较小数据框中的基因.这也不必使用任何 SQL 查询.我还包括一个部分,用于立即识别没有任何 SNP 在其范围内的任何冗余基因.这利用了我通常会尽量避免的双 for 循环 - 但在这种情况下它工作得很好.
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
虽然这不会运行得非常快 - 它确实运行,所以我实际上可以得到一些答案.我仍然想知道是否有人有任何提示可以让它更有效地运行.
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
这篇关于在多个条件下合并数据帧 - 不是特别针对相等的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!