在多个条件下合并数据帧 - 不是特别针对相等的值 [英] Merging DataFrames on multiple conditions - not specifically on equal values

查看:15
本文介绍了在多个条件下合并数据帧 - 不是特别针对相等的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,对不起,如果这有点冗长,但我想完整描述我遇到的问题以及我已经尝试过的内容.

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

我正在尝试在多个条件下将两个数据帧对象连接(合并)在一起.如果要满足的条件都是等于"运算符,我知道如何执行此操作,但是,我需要使用 LESS THAN 和 MORE THAN.

I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.

数据框代表遗传信息:一个是基因组中的突变列表(称为 SNP),另一个提供有关基因在人类基因组上的位置的信息.对这些执行 df.head() 返回以下内容:

The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:

   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972  752721
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

这会显示 SNP 参考 ID 及其位置.BP"代表Base-Pair"位置.

This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.

   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

这个数据框显示了所有感兴趣基因的位置.

This dataframe shows the locations of all the genes of interest.

我想找出属于基因组中基因区域内的所有 SNP,并丢弃这些区域之外的 SNP.

What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.

如果我想根据多个(等于)条件将两个数据帧合并在一起,我会执行以下操作:

If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

但是,在这种情况下 - 我需要找到染色体值与 Gene 数据框中的值匹配的 SNP,并且 BP 值介于chr_start"和chr_stop"之间.具有挑战性的是这些数据框非常大.在当前数据集中,snp_df 有 6795021 行,gene_df 有 34362.

However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.

我试图通过分别查看染色体或基因来解决这个问题.由于未使用性染色体,因此有 22 个不同的染色体值(整数 1-22).这两种方法都需要很长时间.一种使用 pandasql 模块,而另一种方法是遍历单独的基因.

I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.

import pandas as pd
import pandasql as psql

pysqldf = lambda q: psql.sqldf(q, globals())

q           = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes  = gene_df.loc[gene_df['chromosome'] == chromosome]
    genic_snps  = pysqldf(q)
    all_dfs.append(genic_snps)

all_genic_snps  = pd.concat(all_dfs)

基因迭代法

all_dfs = []
for line in gene_df.iterrows():
    info    = line[1] # Getting the Series object
    this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
            (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
    if this_snp.shape[0] != 0:
        this_snp = this_snp[['SNP']]
        this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
        all_dfs.append(this_snp)


all_genic_snps = pd.concat(all_dfs)

任何人都可以就更有效的方法提出任何建议吗?

Can anyone give any suggestions of a more effective way of doing this?

推荐答案

我刚刚想到了一个方法来解决这个问题——结合我的两种方法:

I've just thought of a way to solve this - by combining my two methods:

首先,关注单个染色体,然后遍历这些较小数据框中的基因.这也不必使用任何 SQL 查询.我还包括一个部分,用于立即识别没有任何 SNP 在其范围内的任何冗余基因.这利用了我通常会尽量避免的双 for 循环 - 但在这种情况下它工作得很好.

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

虽然这不会运行得非常快 - 它确实运行,所以我实际上可以得到一些答案.我仍然想知道是否有人有任何提示可以让它更有效地运行.

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

这篇关于在多个条件下合并数据帧 - 不是特别针对相等的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆