解析4个数据帧和一个fasta文件 [英] parsing 4 dataframe and a fasta file
问题描述
我实际上有4个不同的数据框,分别对应于用奥古斯丁预测的2个不同物种的基因信息,在这些物种中,我用sp1的sp2训练参数和sp2的sp2训练参数训练了数据库
I have actually 4 different dataframe corresponding to informations from gene predicted with augustus for 2 different species and within these species, I trained the database with the training parameters of the sp1 for the sp2 and the training parameters of the sp2 for the sp1.
这里是语法名称的一个示例,可以更好地理解.
Here is the exemple of the syntax name to better understand.
0035: Lepidoptera
0042: WASP
g1.t1_0035_0035 :
该基因已通过物种0035的数据库及其自身的训练参数进行了预测.
g1.t1_0035_0035 :
this gene has been predicted with the database of the specie 0035 and its own training parameters.
g1.t1_0035_0042 :
该基因已通过物种0035的数据库和物种0042的训练参数进行了预测.
g1.t1_0035_0042 :
this gene has been predicted with the database of the specie 0035 and with the training parameters of the specie 0042.
g1.t1_0042_0042 :
该基因已通过0042物种的数据库及其自身的训练参数进行了预测.
g1.t1_0042_0042 :
this gene has been predicted with the database of the specie 0042 and its own training parameters.
g1.t1_0042_0035 :
该基因已经通过0042号物种的数据库和0035号物种的训练参数进行了预测.
g1.t1_0042_0035 :
this gene has been predicted with the database of the specie 0042 and with the training parameters of the specie 0035.
现在我有4个数据框,例如:
And now I have 4 dataframe such :
gene_name scaf_name scaf_length cov_depth GC
g3.t1 scaffold 6 56786 79 0.39
g4.t1 scaffold 6 56786 79 0.39
g1.t1 scaffold 256 789765 86 0.42
g2.t1 scaffold 890 3456 85 0.40
g5.t1 scaffold 1234 590 90 0.41
如您所见,基因名称不具有_number1_number2名称 但是每个文件对应一个特定的情况:这是文件名:
as you can see, the gene names do not have the name with _number1_number2 but each file corresponds to a specific situation: here are the file's name:
ggf_0042_0042.csv for all the genex_0042_0042
ggf_0042_0035.csv for all the genex_0042_0035
ggf_0035_0035.csv for all the genex_0035_0035
ggf_0042_0035.csv for all the genex_0042_0035
我真正想要的只是解析一个fasta文件作为示例:
and what I actually would like is simply to parse a fasta file for exemple:
>g13600.t1_0042_0042
MERVINTQLLRYLEDHQLISDRQYGFR...
>g34744.t1_0042_0035
MSVPAHVAQIFEAIRRSGQQIDED...
>g28436.t1_0035_0042
WKKAKAENALDSYHHNHLMSEE...
>g14327.t1_0042_0042
MTYGAETWSLTVGLVRKLRVTQR...
>g30148.t1_0035_0042
MLRPVLSSKLPTNTKLRVYKTYIRSRLTY...
>g24481.t1_0035_0035
PCAGSNIKLKGTECFEKSFEVCLRNY...
说:
如果基因名称中有数字_0035_0035,则进入文件ggf_0035_0035.csv
,获取与相同基因名称相对应的行,并在该行中填充新的数据框.
if in the gene name there is the number _0035_0035, then, go into the file ggf_0035_0035.csv
and grab the row corresponding to the same gene name and fill a new dataframe with this row.
以下是输出的假设示例:
Here is an hypothetical exemple of an output:
gene_name scaf_name scaf_length cov_depth GC
g345.t1_0035_0035 scaffold 567 56778 78 0.39
g23.t1_0042_0035 scaffold 43 434 79 0.43
g46.t1_0042_0042 scaffold 276 785660 87 0.41
g2.t1_0042_0035 scaffold 845 345656 87 0.40
以此类推...
推荐答案
使用 Biopython ,>
from Bio import SeqIO
首先创建字典
ggf = {}
现在遍历记录
for record in SeqIO.parse("example.fasta", "fasta"):
id_ = record.id
尝试匹配表格
parts = id.split('_')
if len(parts) != 3:
continue
查看您是否已经对其进行了解析,如果没有,请进行更新
See if you already parsed it, and update if not
if (parts[1], parts[2]) not in ggf:
f_name = '_'.join('ggf', parts[1], parts[2]) + '.csv'
ggf[(parts[1], parts[2])] = pd.read_csv(f_name)
现在只需使用
df = ggf[(parts[1], parts[2])]
df[df.gene_name == parts[0]]
这篇关于解析4个数据帧和一个fasta文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!