用 python提取两个文件之间的内容

查看:128
本文介绍了用 python提取两个文件之间的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问 题

我有两个文件:

一个文件叫exemple_data.csv 里面包含3个id,每个id一行


ZINC04203483
ZINC26895155
ZINC03651026

一个文件叫exemple.sdf里面包含有十个分子,每个分子有id号和它空间结构数据,每个分子以四个"$$$$"作为结尾


ZINC04203483


  7  6  0  0  0  0  0  0  0  0999 V2000
    1.7848   -1.3593   -0.0709 C   0  0  0  0  0
    1.2676   -3.5870    0.7267 C   0  0  0  0  0
    1.0097   -2.1011    0.9436 C   0  0  0  0  0
    1.6939   -0.0371   -0.0717 N   0  0  0  0  0
    2.5202   -2.0619   -0.9208 N   0  0  0  0  0
    2.4714   -3.9467    0.8577 O   0  0  0  0  0
    0.2468   -4.2712    0.4339 O   0  0  0  0  0
  1  4  1  0  0  0
  2  6  1  0  0  0
  3  1  1  0  0  0
  3  2  1  0  0  0
  1  5  2  0  0  0
  2  7  2  0  0  0
M  CHG  2   5   1   6  -1
M  END
> <rmsd>
0.238019541

$$$$
ZINC02034713


  7  6  0  0  0  0  0  0  0  0999 V2000
    1.4359   -3.6052    0.4738 C   0  0  0  0  0
    1.9307   -1.1052    0.7490 C   0  0  0  0  0
    1.5337   -2.2272   -0.1964 C   0  0  0  0  0
    1.5927    0.2012    0.1266 N   0  0  0  0  0
    2.4694   -4.0171    1.0694 O   0  0  0  0  0
    0.3107   -4.1689    0.3418 O   0  0  0  0  0
    2.5239   -2.3360   -1.2177 O   0  0  0  0  0
  1  5  1  0  0  0
  2  3  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  7  1  0  0  0
  1  6  2  0  0  0
M  CHG  2   4   1   5  -1
M  END
> <rmsd>
0.0787463188

$$$$
ZINC02034711


  7  6  0  0  0  0  0  0  0  0999 V2000
    1.6225   -3.6225    0.5829 C   0  0  0  0  0
    1.0839   -1.1178    0.4821 C   0  0  0  0  0
    2.0739   -2.2211    0.1469 C   0  0  0  0  0
    1.6545    0.1920    0.0735 N   0  0  0  0  0
    0.5089   -4.0191    0.1414 O   0  0  0  0  0
    2.4376   -4.2168    1.3471 O   0  0  0  0  0
    2.2421   -2.2653   -1.2693 O   0  0  0  0  0
  1  5  1  0  0  0
  2  3  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  7  1  0  0  0
  1  6  2  0  0  0
M  CHG  2   4   1   5  -1
M  END
> <rmsd>
0.279566735

$$$$
ZINC26895155


  8  7  0  0  0  0  0  0  0  0999 V2000
    2.1705   -1.5475   -0.5415 C   0  0  0  0  0
    1.3387   -3.5612    0.6628 C   0  0  0  0  0
    1.3018   -2.0375    0.6037 C   0  0  0  0  0
    2.2100   -0.2617   -0.7298 N   0  0  0  0  0
    2.8130   -2.5199   -1.2719 N   0  0  0  0  0
    2.4811   -4.0619    0.8624 O   0  0  0  0  0
    0.2238   -4.1310    0.4963 O   0  0  0  0  0
    1.4055    0.3868    0.2119 O   0  0  0  0  0
  1  5  1  0  0  0
  2  6  1  0  0  0
  3  1  1  0  0  0
  3  2  1  0  0  0
  4  8  1  0  0  0
  1  4  2  0  0  0
  2  7  2  0  0  0
M  CHG  1   6  -1
M  END
> <rmsd>
0.274481624

$$$$
ZINC01695856


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.4057   -3.6199    0.4828 C   0  0  0  0  0
    0.6383   -0.9506    1.9111 C   0  0  0  0  0
    1.4135   -2.2167   -0.1491 C   0  0  0  0  0
    1.6928   -1.0605    0.8132 C   0  0  0  0  0
    2.4525   -3.9696    1.0940 O   0  0  0  0  0
    0.3286   -4.2614    0.3095 O   0  0  0  0  0
    2.4250   -2.2353   -1.1545 O   0  0  0  0  0
    1.6953    0.1565    0.0693 O   0  0  0  0  0
  1  5  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  4  1  0  0  0
  3  7  1  0  0  0
  4  8  1  0  0  0
  1  6  2  0  0  0
M  CHG  1   5  -1
M  END
> <rmsd>
0.0781114399

$$$$
ZINC01695854


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.6021   -3.5832    0.5544 C   0  0  0  0  0
   -0.1123   -1.0849   -0.8065 C   0  0  0  0  0
    2.0136   -2.1983    0.0239 C   0  0  0  0  0
    0.9936   -1.0796    0.2454 C   0  0  0  0  0
    0.5225   -4.0604    0.1088 O   0  0  0  0  0
    2.4141   -4.0828    1.3866 O   0  0  0  0  0
    2.2393   -2.3565   -1.3754 O   0  0  0  0  0
    1.6735    0.1723    0.1761 O   0  0  0  0  0
  1  5  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  4  1  0  0  0
  3  7  1  0  0  0
  4  8  1  0  0  0
  1  6  2  0  0  0
M  CHG  1   5  -1
M  END
> <rmsd>
0.284852803

$$$$
ZINC13352867


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.3740   -3.6291    0.4754 C   0  0  0  0  0
    0.5507   -0.9450    1.8830 C   0  0  0  0  0
    1.3678   -2.2326   -0.1626 C   0  0  0  0  0
    1.6446   -1.1066    0.8351 C   0  0  0  0  0
    1.7289    0.1781    0.0725 N   0  0  0  0  0
    2.4415   -3.9229    1.0855 O   0  0  0  0  0
    0.3299   -4.3189    0.3058 O   0  0  0  0  0
    2.4081   -2.2413   -1.1410 O   0  0  0  0  0
  1  6  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  4  1  0  0  0
  3  8  1  0  0  0
  4  5  1  0  0  0
  1  7  2  0  0  0
M  CHG  2   5   1   6  -1
M  END
> <rmsd>
0.0959857255

$$$$
ZINC01695855


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.6218   -3.6149    0.5878 C   0  0  0  0  0
    0.9014   -0.9485    2.0417 C   0  0  0  0  0
    2.0724   -2.2038    0.1703 C   0  0  0  0  0
    1.1102   -1.0715    0.5348 C   0  0  0  0  0
    0.5070   -4.0057    0.1448 O   0  0  0  0  0
    2.4420   -4.2214    1.3368 O   0  0  0  0  0
    2.2392   -2.2394   -1.2457 O   0  0  0  0  0
    1.6552    0.1562    0.0551 O   0  0  0  0  0
  1  5  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  4  1  0  0  0
  3  7  1  0  0  0
  4  8  1  0  0  0
  1  6  2  0  0  0
M  CHG  1   5  -1
M  END
> <rmsd>
0.280759811

$$$$
ZINC03651026


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.4934   -3.7154    0.5054 C   0  0  0  0  0
    2.4732   -1.3603    0.9745 C   0  0  0  0  0
    2.6877    0.0369    0.4066 C   0  0  0  0  0
    1.7876   -2.3003   -0.0110 C   0  0  0  0  0
    2.5054   -4.3269    0.9548 O   0  0  0  0  0
    0.2927   -4.0978    0.4363 O   0  0  0  0  0
    1.4341    0.6134    0.0639 O   0  0  0  0  0
    2.6547   -2.4571   -1.1350 O   0  0  0  0  0
  1  5  1  0  0  0
  2  3  1  0  0  0
  2  4  1  0  0  0
  3  7  1  0  0  0
  4  1  1  0  0  0
  4  8  1  0  0  0
  1  6  2  0  0  0
M  CHG  1   5  -1
M  END
> <rmsd>
0.315417558

$$$$
ZINC13352859


  8  7  0  0  0  0  0  0  0  0999 V2000
    1.6269   -3.5849    0.5524 C   0  0  0  0  0
   -0.0728   -1.1104   -0.9226 C   0  0  0  0  0
    2.0361   -2.2127   -0.0019 C   0  0  0  0  0
    0.9669   -1.1360    0.1908 C   0  0  0  0  0
    1.6474    0.1967    0.2000 N   0  0  0  0  0
    0.5319   -4.0279    0.1019 O   0  0  0  0  0
    2.4154   -4.0983    1.3946 O   0  0  0  0  0
    2.2503   -2.4010   -1.4013 O   0  0  0  0  0
  1  6  1  0  0  0
  2  4  1  0  0  0
  3  1  1  0  0  0
  3  4  1  0  0  0
  3  8  1  0  0  0
  4  5  1  0  0  0
  1  7  2  0  0  0
M  CHG  2   5   1   6  -1
M  END
> <rmsd>
0.302429646

$$$$

我希望通过第一个文件里的3个ID在第二个文件查找相对应的分子信息,然后输入到一个新文件里。或者输入到6个新文件里每个文件包含一个分子的所有信息内容包括"$$$$"结尾。

我自己编了个程序怎么都不成功,有没有神人可以帮我修改或重新写一个

我编的程序可以用python tire_database_sdf.py exemple_data.csv exemple.sdf result.csv


现在的问题是我result里的内容exemple.sdf 一样有十个分子,可我只希望result内容只包含六个分子信息对照exemple_data.csv 里边的3个id。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import re

filename = sys.argv[1]
inputfile = sys.argv[2]
outfile = sys.argv[3]

def liste_id(filename):
    list_id = []
    with open(filename,"r") as f:
        for i in f:
            i = i.strip("\n")
            list_id.append(i)
        return list_id


identifiant = liste_id(filename)

filout = open(outfile,"w")
with open(inputfile,"r") as filin:
    newmol = False
    element = []
    for line in filin:
        for ele in identifiant:   
            if re.search(ele,line):
                newmol = True   
        if line == "$$$$":  
             newmol = False
        if newmol == True:
             filout.write(line)

解决方案

先问个问题:不应该只输出3个分子信息么?怎么会是6个?我看你的ID列表只能对号上3个啊。。

代码有一个地方写错了,python读入文件一行时末尾会带'\n'的,需要先strip掉。

只修改了两行就好了,我都加了注释:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import re

filename = sys.argv[1]
inputfile = sys.argv[2]
outfile = sys.argv[3]

def liste_id(filename):
    list_id = []
    with open(filename,"r") as f:
        for i in f:
            i = i.strip("\n")
            list_id.append(i)
        return list_id


identifiant = liste_id(filename)

filout = open(outfile,"w")
with open(inputfile,"r") as filin:
    newmol = False
    element = []
    for line in filin:
        line = line.strip()  # strip '\n' character
        for ele in identifiant:
            if re.search(ele,line):
                newmol = True
        if line == "$$$$":
             newmol = False
        if newmol == True:
             filout.write(line + '\n') # append '\n' character

输出结果:

ZINC04203483


7  6  0  0  0  0  0  0  0  0999 V2000
1.7848   -1.3593   -0.0709 C   0  0  0  0  0
1.2676   -3.5870    0.7267 C   0  0  0  0  0
1.0097   -2.1011    0.9436 C   0  0  0  0  0
1.6939   -0.0371   -0.0717 N   0  0  0  0  0
2.5202   -2.0619   -0.9208 N   0  0  0  0  0
2.4714   -3.9467    0.8577 O   0  0  0  0  0
0.2468   -4.2712    0.4339 O   0  0  0  0  0
1  4  1  0  0  0
2  6  1  0  0  0
3  1  1  0  0  0
3  2  1  0  0  0
1  5  2  0  0  0
2  7  2  0  0  0
M  CHG  2   5   1   6  -1
M  END
> <rmsd>
0.238019541

ZINC26895155


8  7  0  0  0  0  0  0  0  0999 V2000
2.1705   -1.5475   -0.5415 C   0  0  0  0  0
1.3387   -3.5612    0.6628 C   0  0  0  0  0
1.3018   -2.0375    0.6037 C   0  0  0  0  0
2.2100   -0.2617   -0.7298 N   0  0  0  0  0
2.8130   -2.5199   -1.2719 N   0  0  0  0  0
2.4811   -4.0619    0.8624 O   0  0  0  0  0
0.2238   -4.1310    0.4963 O   0  0  0  0  0
1.4055    0.3868    0.2119 O   0  0  0  0  0
1  5  1  0  0  0
2  6  1  0  0  0
3  1  1  0  0  0
3  2  1  0  0  0
4  8  1  0  0  0
1  4  2  0  0  0
2  7  2  0  0  0
M  CHG  1   6  -1
M  END
> <rmsd>
0.274481624

ZINC03651026


8  7  0  0  0  0  0  0  0  0999 V2000
1.4934   -3.7154    0.5054 C   0  0  0  0  0
2.4732   -1.3603    0.9745 C   0  0  0  0  0
2.6877    0.0369    0.4066 C   0  0  0  0  0
1.7876   -2.3003   -0.0110 C   0  0  0  0  0
2.5054   -4.3269    0.9548 O   0  0  0  0  0
0.2927   -4.0978    0.4363 O   0  0  0  0  0
1.4341    0.6134    0.0639 O   0  0  0  0  0
2.6547   -2.4571   -1.1350 O   0  0  0  0  0
1  5  1  0  0  0
2  3  1  0  0  0
2  4  1  0  0  0
3  7  1  0  0  0
4  1  1  0  0  0
4  8  1  0  0  0
1  6  2  0  0  0
M  CHG  1   5  -1
M  END
> <rmsd>
0.315417558

这篇关于用 python提取两个文件之间的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆