根据匹配的列映射多个数据框 [英] Mapping multiple dataframe based on the matching columns

查看:79
本文介绍了根据匹配的列映射多个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有25个数据帧,我需要合并这些数据帧并从所有25个数据帧中查找重复出现的行, 例如,我的数据框如下所示,

I have 25 data frames which I need to merge and find recurrently occurring rows from all 25 data frames, For example, my data frame looks like following,

df1
chr start   end     name
1   12334   12334   AAA
1   2342    2342    SAP
2   3456    3456    SOS
3   4537    4537    ABR
df2
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   7689    7689    LUF
df3 
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   4537    4537    LUF
3   8976    8976    BAR
4   6789    6789    AIN

最后,我的目标是要有一个如下的输出数据框,

And In the end, I am aiming to have an output data frame like following,

chr start   end     name    Sample
1   12334   12334   AAA df1
1   12334   12334   AAA df2
1   12334   12334   AAA df3

我可以通过以下解决方案到达那里, 通过字典将这三个数据帧都添加到一个更大的数据帧dfs中

I can get there with the following solution, By dictionary which adds all these three data frames into one bigger data frame dfs

dfs = {'df1':df1,'df2':df2}

dfs = {'df1': df1, 'df2': df2}

然后,

common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])
pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])

这给出了所有三个数据帧中具有匹配行的结果数据帧,但是我有25个数据帧,我从下面的目录中将其作为列表调用,

This gives out the resulting data frame with matching rows from all three data frames, but I have 25 data frames which I am calling as list from the directory as following,

path         = 'Fltered_vcfs/' 
files        = os.listdir(path)
results      = [os.path.join(path,i) for i in files if i.startswith('vcf_filtered')]

因此,如何在字典中显示列表结果",并进一步进行操作以获得所需的输出.任何帮助或建议,我们将不胜感激.

And so how can I show the list 'results' in the dictionary and proceed further to get the desired output. Any help or suggestions are greatly appreciated.

谢谢

推荐答案

使用 glob 模块,您可以使用

Using the glob module, you can use

import os
from glob import glob

path = 'Fltered_vcfs' 
f_names = glob(os.path.join(path, 'vcf_filtered*.*')) 

然后,您可以使用

使用字典理解创建字典.

Then, your dictionary can be created with dictionary comprehension using

import pandas as pd

 {os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names}

这篇关于根据匹配的列映射多个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆