解析4个数据帧和一个fasta文件 [英] parsing 4 dataframe and a fasta file

查看:93
本文介绍了解析4个数据帧和一个fasta文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上有4个不同的数据框,分别对应于用奥古斯丁预测的2个不同物种的基因信息,在这些物种中,我用sp1的sp2训练参数和sp2的sp2训练参数训练了数据库

I have actually 4 different dataframe corresponding to informations from gene predicted with augustus for 2 different species and within these species, I trained the database with the training parameters of the sp1 for the sp2 and the training parameters of the sp2 for the sp1.

这里是语法名称的一个示例,可以更好地理解.

Here is the exemple of the syntax name to better understand.

0035: Lepidoptera
0042: WASP

g1.t1_0035_0035 :该基因已通过物种0035的数据库及其自身的训练参数进行了预测.

g1.t1_0035_0035 : this gene has been predicted with the database of the specie 0035 and its own training parameters.

g1.t1_0035_0042 :该基因已通过物种0035的数据库和物种0042的训练参数进行了预测.

g1.t1_0035_0042 : this gene has been predicted with the database of the specie 0035 and with the training parameters of the specie 0042.

g1.t1_0042_0042 :该基因已通过0042物种的数据库及其自身的训练参数进行了预测.

g1.t1_0042_0042 : this gene has been predicted with the database of the specie 0042 and its own training parameters.

g1.t1_0042_0035 :该基因已经通过0042号物种的数据库和0035号物种的训练参数进行了预测.

g1.t1_0042_0035 : this gene has been predicted with the database of the specie 0042 and with the training parameters of the specie 0035.

现在我有4个数据框,例如:

And now I have 4 dataframe such :

gene_name   scaf_name       scaf_length cov_depth       GC
g3.t1       scaffold 6      56786         79            0.39
g4.t1       scaffold 6      56786         79            0.39
g1.t1       scaffold 256    789765        86            0.42
g2.t1       scaffold 890    3456          85            0.40
g5.t1       scaffold 1234   590           90            0.41

如您所见,基因名称不具有_number1_number2名称 但是每个文件对应一个特定的情况:这是文件名:

as you can see, the gene names do not have the name with _number1_number2 but each file corresponds to a specific situation: here are the file's name:

ggf_0042_0042.csv for all the genex_0042_0042
ggf_0042_0035.csv for all the genex_0042_0035
ggf_0035_0035.csv for all the genex_0035_0035
ggf_0042_0035.csv for all the genex_0042_0035

我真正想要的只是解析一个fasta文件作为示例:

and what I actually would like is simply to parse a fasta file for exemple:

>g13600.t1_0042_0042
MERVINTQLLRYLEDHQLISDRQYGFR...
>g34744.t1_0042_0035
MSVPAHVAQIFEAIRRSGQQIDED...
>g28436.t1_0035_0042
WKKAKAENALDSYHHNHLMSEE...
>g14327.t1_0042_0042
MTYGAETWSLTVGLVRKLRVTQR...
>g30148.t1_0035_0042
MLRPVLSSKLPTNTKLRVYKTYIRSRLTY...
>g24481.t1_0035_0035
PCAGSNIKLKGTECFEKSFEVCLRNY...

说:

如果基因名称中有数字_0035_0035,则进入文件ggf_0035_0035.csv,获取与相同基因名称相对应的行,并在该行中填充新的数据框.

if in the gene name there is the number _0035_0035, then, go into the file ggf_0035_0035.csv and grab the row corresponding to the same gene name and fill a new dataframe with this row.

以下是输出的假设示例:

Here is an hypothetical exemple of an output:

gene_name               scaf_name       scaf_length   cov_depth       GC
g345.t1_0035_0035       scaffold 567      56778         78            0.39
g23.t1_0042_0035        scaffold 43       434           79            0.43
g46.t1_0042_0042        scaffold 276      785660        87            0.41
g2.t1_0042_0035         scaffold 845      345656        87            0.40

以此类推...

推荐答案

使用 Biopython

from Bio import SeqIO

首先创建字典

ggf = {}

现在遍历记录

for record in SeqIO.parse("example.fasta", "fasta"):
    id_ = record.id

尝试匹配表格

    parts = id.split('_')
    if len(parts) != 3:
        continue

查看您是否已经对其进行了解析,如果没有,请进行更新

See if you already parsed it, and update if not

    if (parts[1], parts[2]) not in ggf:
        f_name = '_'.join('ggf', parts[1], parts[2]) + '.csv'
        ggf[(parts[1], parts[2])] = pd.read_csv(f_name)

现在只需使用

    df = ggf[(parts[1], parts[2])]
    df[df.gene_name == parts[0]]

这篇关于解析4个数据帧和一个fasta文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆