pandas 合并DataFrames模块 [英] Pandas Merging DataFrames Module

查看：233 发布时间：2017/3/26 3:12:21 python pandas module dataframe

本文介绍了 pandas 合并DataFrames模块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对Python很新颖，并尝试使用Pandas模块。以下是我的示例文件（每行的第一个元素是read_name;第二个元素是methylation_state;第三个是该位置）。

我的目标是首先在input_sample1.txt和input_sample2.txt 中提取+的所有行，能够做

第二个合并两个数据帧以提取位于第一个DF中的位置第二个;然后提取位于第二DF中的位置，而不是在第二个DF中。

这是我到目前为止，并获得m1和m2 DFs的错误，并出现以下错误：

UserWarning：布尔系列键将重新建立索引以匹配DataFrame索引。 DataFrame索引，UserWarning）

 ＃！/ usr / bin / env python 
 from __future__ import print_function 
 import pandas as pd 
 import sys 
 import pandas as pd 
 
 
 df1 = pd.read_csv（'Input_Sample1.txt'，names = ['read_name'，'methylation_state'，'position']，usecols = ['position'，'methylation_state']，delimiter = r'\s +'）
 df1 = df1 [（df1.methylation_state == '+'）] 
＃print（'df1％s'％（' - '* 50））
＃print（df1）
 
 df2 = pd.read_csv（' Input_Sample2.txt'，names = ['read_name'，'methylation_state'，'position']，usecols = ['position'，'methylation_state']，delimiter = r'\s +'）
 df2 = df2 [ （df2.methylation_state =='+'）] 
 #print（'df2％s'％（' - '* 50））
 #print（df2）
 #get一个错误以下合并的数据框m1和m2：
 m1 = pd.merge（df1，df2，how ='left'，on ='position'）
 print（ df2  -  df1％s'％（' - '* 50））
 print（df2 [m1 ['methylation_state_y']。isnull（）]）
 
 m2 = pd.merge（df1 ，df2，how ='left'，on ='position'）
 print（'df1  -  df2％s'％（' - '* 50））
 print（df1 [m2 ['methylation_state_y' ] .isnull（）]）

Input_Sample1.txt：

  SRR1035452.114_CRIRUN_726：7：1101：3884：2095_length = 36 + 37151024 
 SRR1035452.114_CRIRUN_726：7：1101：3884：2095_length = 36 + 37151031 
 SRR1035452.114_CRIRUN_726：7：1101：3884：2095_length = 36 + 37151189 
 SRR1035452.117_CRIRUN_726：7：1101：3789：2132_length = 36 + 23189251 
 SRR1035452.117_CRIRUN_726：7：1101：3789： 2132_length = 36 + 23189248 
 SRR1035452.117_CRIRUN_726：7：1101：3789：2132_length = 36 + 23189242 
 SRR1035452.117_CRIRUN_726：7：1101：3789：2132_length = 36 + 23189086 
 SRR1035452.117_CRIRUN_726 ：7：1101：3789：2132_length = 36 + 23189101 
 SRR1035452.211_CRIRUN_726：7：1101：5833： 2115_length = 36 + 60644021 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644026 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644032 
 SRR1035452.211_CRIRUN_726 ：7：1101：5833：2115_length = 36 + 60644038 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644042 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644050 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644055 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644267 
 SRR1035452.211_CRIRUN_726：7：1101： 5833：2115_length = 36 + 60644253 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644246 
 SRR1035452.211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644240 
 SRR1035452 .211_CRIRUN_726：7：1101：5833：2115_length = 36 + 60644236 
 SRR1035452.336_CRIRUN_726：7：1101：8029：2240_length = 36 + 26775201 
 SRR1035452.336_CRIRUN_726：7：1101：8029：2240_length = 36 + 26775193 
 SR R1035452.336_CRIRUN_726：7：1101：8029：2240_length = 36 + 26775178 
 SRR1035452.336_CRIRUN_726：7：1101：8029：2240_length = 36 + 26775012 
 SRR1035452.377_CRIRUN_726：7：1101：9240：2160_length = 36 + 27851064 
 SRR1035452.377_CRIRUN_726：7：1101：9240：2160_length = 36 + 27851253

INPUT_SAMPLE2.txt文件：

  SRR1035454.47_CRIRUN_726：7：1101：2618：2094_length = 36  -  18921902 
 SRR1035454 .47_CRIRUN_726：7：1101：2618：2094_length = 36 + 18921911 
 SRR1035454.47_CRIRUN_726：7：1101：2618：2094_length = 36 + 18921919 
 SRR1035454.47_CRIRUN_726：7：1101：2618：2094_length = 36 + 18921926 
 SRR1035454.47_CRIRUN_726：7：1101：2618：2094_length = 36 + 18922145 
 SRR1035454.174_CRIRUN_726：7：1101：6245：2159_length = 36 + 51460469 
 SRR1035454.174_CRIRUN_726：7： 1101：6245：2159_length = 36 + 51460488 
 SRR1035454.174_CRIRUN_726：7：1101：6245：2159_length = 36 + 51460631 
 SRR1035454.174_CRIRUN_726：7：1101：6245：2159_length = 36 + 51460613 
 SRR1035454.174_CRIRUN_726：7：1101：6245：2159_length = 36 + 51460608 
 SRR1035454.215_CRIRUN_726：7：1101：7106：2100_length = 36  -  30309836 
 SRR1035454.216_CRIRUN_726： 7：1101：7129：2116_length = 36 + 31856610 
 SRR1035454.216_CRIRUN_726：7：1101：7129：2116_length = 36 + 31856602 
 SRR1035454.216_CRIRUN_726：7：1101：7129：2116_length = 36 + 31856255 
 SRR1035454.270_CRIRUN_726：7：1101：8134：2171_length = 36 + 26078372 
 SRR1035454.270_CRIRUN_726：7：1101：8134：2171_length = 36 + 26078363 
 SRR1035454.306_CRIRUN_726：7：1101：9223 ：2098_length = 36 + 55329938 
 SRR1035454.348_CRIRUN_726：7：1101：10157：2107_length = 36 + 40179303 
 SRR1035454.348_CRIRUN_726：7：1101：10157：2107_length = 36 + 40179299 
 SRR1035454。 348_CRIRUN_726：7：1101：10157：2107_length = 36 + 40179018

DF1输入的一部分：

  0 + 37151024 
 1 + 37151031 
 2 + 37151189 
 3 + 23189251 
 4 + 23189248 
 5 + 23189242 
 6 + 23189086 
 7 + 23189101 
 8 + 60644021 
 9 + 60644026 
 10 + 60644032 
 11 + 60644038 
 12 + 60644042 
 13 + 60644050 
 14 + 60644055 
 15 + 60644267 
 16 + 60644253 
 17 + 60644246 
 18 + 60644240

DF2输出的一部分： p>

  methylation_state position 
 1 + 18921911 
 2 + 18921919 
 3 + 18921926 
 4 + 18922145 
 5 + 51460469 
 6 + 51460488 
 7 + 51460631 
 8 + 51460613 
 9 + 51460608 
 11 + 31856610 
 12 + 31856602 
 13 + 31856255 
 14 + 26078372

请注意 每个文本文件包含约80k行。任何帮助/建议非常感谢!!

解决方案

尝试这样：

 ＃/ usr / bin / env python 
 from __future__ import print_function 
 import sys 
 import pandas as pd 
 
 sys.stdout = open（'CHG_comparison.txt'，'w'）
 
 
 ESfemale = pd.read_csv（'Input_Sample1.txt'，names = ['read_name'，'methylation_state' ，'position']，usecols = ['position'，'methylation_state']，delimiter = r'\s +'）
 ESfemale = ESfemale [（ESfemale.methylation_state =='+'）] 
 ＃print（'ESfemale CHF context of all methylation sites％s'％（' - '* 50））
＃print（ESfemale）
 
 EpiSC = pd.read_csv（'Input_Sample2.txt '，names = ['read_name'，'methylation_state'，'position']，usecols = ['position'，'methylation_state']，delimiter = r'\s +'）
 EpiSC = EpiSC [（EpiSC。 methylation_state =='+'）] 
 #print（'EpiSC所有甲基化位点的CHG上下文％s'％（' - '* 50））
 #print（EpiSC）
 #print（ESfemale [['methylation_state'，'position']]。isin（EpiSC.to_dict（orient ='list'）））
 diff1 = ESfemale.ix [〜ESfemale [['methylation_state'，'position']]。isin（EpiSC.to_dict（orient ='list'））。all（axis = 1）] 
 print diff1）
 diff1.to_csv（'diff1.csv'）
 
 diff2 = EpiSC.ix [〜EpiSC [['methylation_state'，'position']]。isin（ESfemale.to_dict（ orient ='list'））。all（axis = 1）] 
 print（diff2）
 diff2.to_csv（'diff2.csv'）
  pre> 
 
  PS在您的示例文件中没有相交的集合，所以我不得不将file1的几行添加到文件2，反之亦然，以便测试它。
 
I am somewhat new to Python and trying to use the Pandas module. Below are my sample files (first element of each line is the read_name; second element is the methylation_state; and third is the position). 

My goal is to first extract all lines with '+' in input_sample1.txt and input_sample2.txt which I was able to do

Second merge two data frames to extract positions that are in the first DF and not the second one; and then extract positions that are in the second DF and not in the second one. 

This is what I have thus far and get errors for both m1 and m2 DFs with the following error: 

UserWarning: Boolean Series key will be reindexed to match DataFrame index.   "DataFrame index.", UserWarning)
   #!/usr/bin/env python
   from __future__ import print_function
   import pandas as pd
   import sys
   import pandas as pd


   df1=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
   df1 = df1[(df1.methylation_state == '+')]
   # print('df1 %s' % ('-' * 50))
   # print(df1)

   df2=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
   df2 = df2[(df2.methylation_state == '+')]
   #print('df2 %s' % ('-' * 50))
   #print(df2)
 #get an error for the following merged dataframes m1 and m2:
  m1=pd.merge(df1, df2, how='left', on='position')
  print('df2 - df1 %s' % ('-' * 50))
  print(df2[m1['methylation_state_y'].isnull()])

  m2 = pd.merge(df1, df2, how='left', on='position')
  print('df1 - df2 %s' % ('-' * 50))
  print(df1[m2['methylation_state_y'].isnull()])
Input_Sample1.txt:            
    SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36    +   37151024
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36    +   37151031
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36    +   37151189
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36    +   23189251
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36    +   23189248
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36    +   23189242
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36    +   23189086
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36    +   23189101
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644021
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644026
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644032
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644038
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644042
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644050
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644055
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644267
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644253
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644246
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644240
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36    +   60644236
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36    +   26775201
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36    +   26775193
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36    +   26775178
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36    +   26775012
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36    +   27851064
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36    +   27851253
INPUT_SAMPLE2.txt file: 
   SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36  -   18921902
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 +   18921911
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 +   18921919
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 +   18921926
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 +   18922145
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36    +   51460469
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36    +   51460488
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36    +   51460631
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36    +   51460613
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36    +   51460608
SRR1035454.215_CRIRUN_726:7:1101:7106:2100_length=36    -   30309836
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36    +   31856610
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36    +   31856602
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36    +   31856255
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36    +   26078372
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36    +   26078363
SRR1035454.306_CRIRUN_726:7:1101:9223:2098_length=36    +   55329938
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36   +   40179303
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36   +   40179299
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36   +   40179018
part of DF1 input:
0                     +  37151024
1                     +  37151031
2                     +  37151189
3                     +  23189251
4                     +  23189248
5                     +  23189242
6                     +  23189086
7                     +  23189101
8                     +  60644021
9                     +  60644026
10                    +  60644032
11                    +  60644038
12                    +  60644042
13                    +  60644050
14                    +  60644055
15                    +  60644267
16                    +  60644253
17                    +  60644246
18                    +  60644240
part of DF2 output:
      methylation_state  position
1                     +  18921911
2                     +  18921919
3                     +  18921926
4                     +  18922145
5                     +  51460469
6                     +  51460488
7                     +  51460631
8                     +  51460613
9                     +  51460608
11                    +  31856610
12                    +  31856602
13                    +  31856255
14                    +  26078372
PLEASE NOTE Each text file contains about 80k lines. Any help/advice is much appreciated!!
 解决方案 
Try this:
#!/usr/bin/env python
from __future__ import print_function
import sys
import pandas as pd

sys.stdout=open('CHG_comparison.txt', 'w')


ESfemale=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
ESfemale = ESfemale[(ESfemale.methylation_state == '+')]
# print('ESfemale CHF context of all methylation sites %s' % ('-' * 50))
# print(ESfemale)

EpiSC=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
EpiSC = EpiSC[(EpiSC.methylation_state == '+')]
#print('EpiSC CHG context of all methylation sites %s' % ('-' * 50))
#print(EpiSC)
#print(ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list')))
diff1 = ESfemale.ix[~ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list')).all(axis=1)]
print(diff1)
diff1.to_csv('diff1.csv')

diff2 = EpiSC.ix[~EpiSC[['methylation_state', 'position']].isin(ESfemale.to_dict(orient='list')).all(axis=1)]
print(diff2)
diff2.to_csv('diff2.csv')
PS there were no "intersecting" sets in your sample files, so i had to add a few rows form file1 to file 2 and vice versa in order to test it.

                        这篇关于 pandas 合并DataFrames模块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 合并DataFrames模块 [英] Pandas Merging DataFrames Module

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 合并DataFrames模块 [英] Pandas Merging DataFrames Module

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭