pandas 合并DataFrames模块 [英] Pandas Merging DataFrames Module
问题描述
我的目标是首先在input_sample1.txt和input_sample2.txt 中提取+的所有行,能够做
第二个合并两个数据帧以提取位于第一个DF中的位置第二个;然后提取位于第二DF中的位置,而不是在第二个DF中。
这是我到目前为止,并获得m1和m2 DFs的错误,并出现以下错误:
UserWarning:布尔系列键将重新建立索引以匹配DataFrame索引。 DataFrame索引,UserWarning)
#!/ usr / bin / env python
from __future__ import print_function
import pandas as pd
import sys
import pandas as pd
df1 = pd.read_csv('Input_Sample1.txt',names = ['read_name','methylation_state','position'],usecols = ['position','methylation_state'],delimiter = r'\s +')
df1 = df1 [(df1.methylation_state == '+')]
#print('df1%s'%(' - '* 50))
#print(df1)
df2 = pd.read_csv(' Input_Sample2.txt',names = ['read_name','methylation_state','position'],usecols = ['position','methylation_state'],delimiter = r'\s +')
df2 = df2 [ (df2.methylation_state =='+')]
#print('df2%s'%(' - '* 50))
#print(df2)
#get一个错误以下合并的数据框m1和m2:
m1 = pd.merge(df1,df2,how ='left',on ='position')
print( df2 - df1%s'%(' - '* 50))
print(df2 [m1 ['methylation_state_y']。isnull()])
m2 = pd.merge(df1 ,df2,how ='left',on ='position')
print('df1 - df2%s'%(' - '* 50))
print(df1 [m2 ['methylation_state_y' ] .isnull()])
Input_Sample1.txt:
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length = 36 + 37151024
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length = 36 + 37151031
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length = 36 + 37151189
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length = 36 + 23189251
SRR1035452.117_CRIRUN_726:7:1101:3789: 2132_length = 36 + 23189248
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length = 36 + 23189242
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length = 36 + 23189086
SRR1035452.117_CRIRUN_726 :7:1101:3789:2132_length = 36 + 23189101
SRR1035452.211_CRIRUN_726:7:1101:5833: 2115_length = 36 + 60644021
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644026
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644032
SRR1035452.211_CRIRUN_726 :7:1101:5833:2115_length = 36 + 60644038
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644042
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644050
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644055
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644267
SRR1035452.211_CRIRUN_726:7:1101: 5833:2115_length = 36 + 60644253
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644246
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644240
SRR1035452 .211_CRIRUN_726:7:1101:5833:2115_length = 36 + 60644236
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length = 36 + 26775201
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length = 36 + 26775193
SR R1035452.336_CRIRUN_726:7:1101:8029:2240_length = 36 + 26775178
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length = 36 + 26775012
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length = 36 + 27851064
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length = 36 + 27851253
INPUT_SAMPLE2.txt文件:
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length = 36 - 18921902
SRR1035454 .47_CRIRUN_726:7:1101:2618:2094_length = 36 + 18921911
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length = 36 + 18921919
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length = 36 + 18921926
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length = 36 + 18922145
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length = 36 + 51460469
SRR1035454.174_CRIRUN_726:7: 1101:6245:2159_length = 36 + 51460488
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length = 36 + 51460631
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length = 36 + 51460613
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length = 36 + 51460608
SRR1035454.215_CRIRUN_726:7:1101:7106:2100_length = 36 - 30309836
SRR1035454.216_CRIRUN_726: 7:1101:7129:2116_length = 36 + 31856610
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length = 36 + 31856602
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length = 36 + 31856255
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length = 36 + 26078372
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length = 36 + 26078363
SRR1035454.306_CRIRUN_726:7:1101:9223 :2098_length = 36 + 55329938
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length = 36 + 40179303
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length = 36 + 40179299
SRR1035454。 348_CRIRUN_726:7:1101:10157:2107_length = 36 + 40179018
DF1输入的一部分:
0 + 37151024
1 + 37151031
2 + 37151189
3 + 23189251
4 + 23189248
5 + 23189242
6 + 23189086
7 + 23189101
8 + 60644021
9 + 60644026
10 + 60644032
11 + 60644038
12 + 60644042
13 + 60644050
14 + 60644055
15 + 60644267
16 + 60644253
17 + 60644246
18 + 60644240
DF2输出的一部分: p>
methylation_state position
1 + 18921911
2 + 18921919
3 + 18921926
4 + 18922145
5 + 51460469
6 + 51460488
7 + 51460631
8 + 51460613
9 + 51460608
11 + 31856610
12 + 31856602
13 + 31856255
14 + 26078372
请注意 每个文本文件包含约80k行。任何帮助/建议非常感谢!!
尝试这样:
#/ usr / bin / env python
pre>
from __future__ import print_function
import sys
import pandas as pd
sys.stdout = open('CHG_comparison.txt','w')
ESfemale = pd.read_csv('Input_Sample1.txt',names = ['read_name','methylation_state' ,'position'],usecols = ['position','methylation_state'],delimiter = r'\s +')
ESfemale = ESfemale [(ESfemale.methylation_state =='+')]
#print('ESfemale CHF context of all methylation sites%s'%(' - '* 50))
#print(ESfemale)
EpiSC = pd.read_csv('Input_Sample2.txt ',names = ['read_name','methylation_state','position'],usecols = ['position','methylation_state'],delimiter = r'\s +')
EpiSC = EpiSC [(EpiSC。 methylation_state =='+')]
#print('EpiSC所有甲基化位点的CHG上下文%s'%(' - '* 50))
#print(EpiSC)
#print(ESfemale [['methylation_state','position']]。isin(EpiSC.to_dict(orient ='list')))
diff1 = ESfemale.ix [〜ESfemale [['methylation_state','position']]。isin(EpiSC.to_dict(orient ='list'))。all(axis = 1)]
print diff1)
diff1.to_csv('diff1.csv')
diff2 = EpiSC.ix [〜EpiSC [['methylation_state','position']]。isin(ESfemale.to_dict( orient ='list'))。all(axis = 1)]
print(diff2)
diff2.to_csv('diff2.csv')
PS在您的示例文件中没有相交的集合,所以我不得不将file1的几行添加到文件2,反之亦然,以便测试它。
I am somewhat new to Python and trying to use the Pandas module. Below are my sample files (first element of each line is the read_name; second element is the methylation_state; and third is the position).
My goal is to first extract all lines with '+' in input_sample1.txt and input_sample2.txt which I was able to do
Second merge two data frames to extract positions that are in the first DF and not the second one; and then extract positions that are in the second DF and not in the second one.
This is what I have thus far and get errors for both m1 and m2 DFs with the following error:
UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning)
#!/usr/bin/env python from __future__ import print_function import pandas as pd import sys import pandas as pd df1=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+') df1 = df1[(df1.methylation_state == '+')] # print('df1 %s' % ('-' * 50)) # print(df1) df2=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+') df2 = df2[(df2.methylation_state == '+')] #print('df2 %s' % ('-' * 50)) #print(df2) #get an error for the following merged dataframes m1 and m2: m1=pd.merge(df1, df2, how='left', on='position') print('df2 - df1 %s' % ('-' * 50)) print(df2[m1['methylation_state_y'].isnull()]) m2 = pd.merge(df1, df2, how='left', on='position') print('df1 - df2 %s' % ('-' * 50)) print(df1[m2['methylation_state_y'].isnull()])
Input_Sample1.txt:
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151024 SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151031 SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151189 SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189251 SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189248 SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189242 SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189086 SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189101 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644021 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644026 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644032 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644038 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644042 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644050 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644055 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644267 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644253 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644246 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644240 SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644236 SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775201 SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775193 SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775178 SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775012 SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36 + 27851064 SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36 + 27851253
INPUT_SAMPLE2.txt file:
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 - 18921902 SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921911 SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921919 SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921926 SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18922145 SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460469 SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460488 SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460631 SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460613 SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460608 SRR1035454.215_CRIRUN_726:7:1101:7106:2100_length=36 - 30309836 SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856610 SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856602 SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856255 SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36 + 26078372 SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36 + 26078363 SRR1035454.306_CRIRUN_726:7:1101:9223:2098_length=36 + 55329938 SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179303 SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179299 SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179018
part of DF1 input:
0 + 37151024 1 + 37151031 2 + 37151189 3 + 23189251 4 + 23189248 5 + 23189242 6 + 23189086 7 + 23189101 8 + 60644021 9 + 60644026 10 + 60644032 11 + 60644038 12 + 60644042 13 + 60644050 14 + 60644055 15 + 60644267 16 + 60644253 17 + 60644246 18 + 60644240
part of DF2 output:
methylation_state position 1 + 18921911 2 + 18921919 3 + 18921926 4 + 18922145 5 + 51460469 6 + 51460488 7 + 51460631 8 + 51460613 9 + 51460608 11 + 31856610 12 + 31856602 13 + 31856255 14 + 26078372
PLEASE NOTE Each text file contains about 80k lines. Any help/advice is much appreciated!!
解决方案Try this:
#!/usr/bin/env python from __future__ import print_function import sys import pandas as pd sys.stdout=open('CHG_comparison.txt', 'w') ESfemale=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+') ESfemale = ESfemale[(ESfemale.methylation_state == '+')] # print('ESfemale CHF context of all methylation sites %s' % ('-' * 50)) # print(ESfemale) EpiSC=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+') EpiSC = EpiSC[(EpiSC.methylation_state == '+')] #print('EpiSC CHG context of all methylation sites %s' % ('-' * 50)) #print(EpiSC) #print(ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list'))) diff1 = ESfemale.ix[~ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list')).all(axis=1)] print(diff1) diff1.to_csv('diff1.csv') diff2 = EpiSC.ix[~EpiSC[['methylation_state', 'position']].isin(ESfemale.to_dict(orient='list')).all(axis=1)] print(diff2) diff2.to_csv('diff2.csv')
PS there were no "intersecting" sets in your sample files, so i had to add a few rows form file1 to file 2 and vice versa in order to test it.
这篇关于 pandas 合并DataFrames模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!