从蛋白质数据库的蛋白质序列比对宇宙或UNIPROT [英] Protein Sequence Alignment from Protein Databank to Cosmic or Uniprot

查看:481
本文介绍了从蛋白质数据库的蛋白质序列比对宇宙或UNIPROT的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想PDB文件从蛋白质数据库中显示宇宙或UNIPROT匹配,以规范的AA序列的蛋白质。具体而言,我需要做的是从PDB文件,骨干层碳原子的α和他们的某某位置上拉。我还需要拉他们的实际顺序中的蛋白质序列。对于结构3GFT(KRAS - UniProt登录号P01116),这很简单,我可以只取ResSeq数量。然而,对于一些其他蛋白质,我想不通这是怎么可能的。

例如,对于结构(2ZHQ)(蛋白F 2 - UniProt登录号P00734),所述SEQRES具有重复数1和14的ResSeq号码和唯一的不同之处的I code条目。另外在I code项不在的词素文字顺序,因此很难说什么样的顺序来提取。

据得到的更加糟糕,如果你考虑结构3V5Q(UniProt登录号Q16288)。对于大多数的蛋白质,所述ResSeq号码相匹配的实际氨基酸从像COSMIC或UNIPROT源。 Howver位置711后,它跳转到位置730。当看到备注465(缺少原子),它表明了A链,726-729人失踪。然而匹配它的蛋白质后,这些AA实际上是712-715。

我已经把它贴code,它的工作原理来回回简单3GFT的例子,但如果有人在PDB文件的专家,可以帮我把剩下的就想通了,我将非常感激。

 库(GDATA)

#答案&所述;  -  get.positions(http://www.pdb.org/pdb/files/2ZHQ.pdb,L)
答案<  -  get.positions(http://www.pdb.org/pdb/files/3GFT.pdb,A)


#这个函数读取PDB文件,并返回相应的数据结构
get.positions<  - 功能(的资源文件,chain_required =A){
N'LT;  -  10 ^ 5
AACount<  -  0
位置= data.frame(渣=代表(NA,N),AtomCount =代表(0,N),侧链=代表(NA,N),XCOORD =代表(0,N),YCOORD =代表(0,N) ,ZCoord =代表(0,N),stringsAsFactors = FALSE)

走访=名单()
Filedata上<  -  readlines方法(的资源文件中,n = -1)
对于(i的1:长度(Filedata上)){
输入= Filedata上[我]
的id = SUBSTR(输入,1,4)
如果(ID ==ATOM){
  类型= SUBSTR(输入,14,15)
  如果(类型==CA){
    resSerial = strtoi(SUBSTR(输入,7,11))
    残余物= SUBSTR(输入,18,20)
    type_of_chain = SUBSTR(输入,22,22)
    resSeq = strtoi(SUBSTR(输入,23,26))
    altLoc = SUBSTR(输入,17,17)

    如果(resSeq> = 1){#does不包括负面的残留物
      如果(type_of_chain == chain_required和放大器;&安培;!(resSerial%2%访问)及和放大器; altLoc(==|| altLoc ==A)){
        参观<  -  C(参观,resSerial)
        AACount<  -  AACount + 1
        position_string =名单()
        position_string [[1]] = as.numeric(SUBSTR(输入,31,38))
        position_string [[2] = as.numeric(SUBSTR(输入,39,46))
        position_string [[3]] = as.numeric(SUBSTR(输入,47,54))
        #PRINT(输入)
        位置[AACount,]其中;  -  C(残基,resSeq,type_of_chain,position_string [[1]],position_string [[2]],position_string [[3]])
      }
    }
  }
}

  }
  位置与所述;  - 位[1:AACount,]
  位置[2]其中;  -  as.numeric(位置[2])
  位置[,4]其中;  -  as.numeric(位置[1,4])
  位置[1,5]&其中;  -  as.numeric(位置[1,5])
  位置[1,6]&其中;  -  as.numeric(位置[1,6])
  返回(位置)
}
 

解决方案

下面是一种方法。它要求bio3dř包( http://thegrantlab.org/bio3d/ 的)和肌肉对准可执行是安装。我跟着指示为其他公用事业这里 http://thegrantlab.org/bio3d/tutorials/installing -bio3d

这个例子code出现下面来为您列出的三种情况。

 库(bio3d)

##下载PDB文件与给定的'身份证'(也可以从网上阅读)
ID<  - 3GFT#3V5Q
file.name&所述;  -  get.pdb(id)的
PDB<  -  read.pdb(file.name)
pdb.seq<  -  pdbseq(PDB,atom.select(PDB,链=A,elety =CA))

##获取的UniProt标识符和序列
pdb.ano&所述;  -  pdb.annotate(ID,DB_ID)
uni.seq<  -  get.seq(pdb.ano)

##对齐序列定义corespondences
氮化铝<  -  seqaln(seqbind(pdb.seq,uni.seq),ID = C(file.name,不公开(pdb.ano)))

##阅读对齐坐标数据(所有你想要的信息在这里)
PDBS<  -  read.fasta.pdb(ALN)

answer2<  -  cbind(1:NCOL(PDBS $阿里),T(PDBS $阿里)
            PDBS $ resno [1],矩阵(PDBS $ XYZ [1],NCOL = 3,byrow = T))

头(answer2)

[1,]1的M的M,1,62.935,97.579,30.223
[2,]2的T的T,2,63.155,95.525,27.079
[3]3EE365.28996.89524.308
[4,]4的Y,Y464.899,96.22,20.615
[5,1]5的K,K567.593,96.715,18.023
[6,]6,L,L,6,65.898,97.863,14.816
 

如果你希望你的氨基酸的3字母code列有在bio3d一个aa321()函数。

I would like to match up PDB files from the Protein Databank to canonical AA sequences for the protein as displayed in Cosmic or Uniprot. Specifically, what I need to do is pull from the pdb file, the carbon alpha atoms in the backbone and their xyz positions. I also need to pull their actual order in the proteins sequence. For structure 3GFT (Kras - Uniprot Accession Number P01116), this is easy, I can just take the ResSeq number. However, for some other proteins, I can't figure out how this is possible.

For example, for structure (2ZHQ) (protein F2 - Uniprot Accession Number P00734), the Seqres has the ResSeq numbers repeated for numbers "1" and "14" and only differs in the Icode entry. Further the icode entries are not in lexographic order so it's hard to tell what order to extract.

It get's even worse if you consider structure 3V5Q (Uniprot Accession Number Q16288). For most of the protein, the ResSeq number matches the actual amino acid from a source like COSMIC or UNIPROT. Howver after Position 711, it jumps to position 730. When looking at REMARK 465 (the missing atoms), it shows that for chain A , 726-729 are missing. However after matching it up to the protein, those AA actually are 712-715.

I've attached code that works fro the simple 3GFT example but if someone is an expert in pdb files and can help me get the rest of it figured out, I would be much obliged.

library(gdata)

#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")


#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)     

visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
  type = substr(input,14,15)
  if(type == "CA"){
    resSerial = strtoi(substr(input, 7,11))
    residue = substr(input,18,20)
    type_of_chain = substr(input,22,22)
    resSeq = strtoi(substr(input, 23,26))
    altLoc = substr(input,17,17)

    if(resSeq >=1){ #does not include negative residues
      if(type_of_chain == chain_required && !(resSerial %in% visited)  && (altLoc == " " || altLoc == "A") ){
        visited <- c(visited, resSerial)
        AACount <- AACount + 1
        position_string =list()
        position_string[[1]]= as.numeric(substr(input,31,38))
        position_string[[2]] =as.numeric(substr(input,39,46))
        position_string[[3]] =as.numeric(substr(input,47,54))
        #print(input)
        positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
      }
    }
  }
}

  } 
  positions<-positions[1:AACount,]
  positions[,2]<- as.numeric(positions[,2])
  positions[,4]<- as.numeric(positions[,4])
  positions[,5]<- as.numeric(positions[,5])
  positions[,6]<- as.numeric(positions[,6])
  return (positions)
}

解决方案

Here is one way. It requires the bio3d R package ( http://thegrantlab.org/bio3d/ ) and the muscle alignment executable be installed. I followed the instructions for 'Additional utilities' here http://thegrantlab.org/bio3d/tutorials/installing-bio3d

The example code below appears to work for the three cases you listed.

library(bio3d) 

## Download PDB file with given 'id' (can also read from online)
id <-  "3GFT" #"3V5Q"
file.name <- get.pdb(id)
pdb <- read.pdb(file.name)
pdb.seq <- pdbseq(pdb, atom.select(pdb, chain="A", elety="CA"))

## Get UniProt identifier and sequence
pdb.ano <- pdb.annotate(id, "db_id")
uni.seq <- get.seq(pdb.ano)

## Align sequences to define corespondences
aln <- seqaln( seqbind( pdb.seq, uni.seq), id=c(file.name, unlist(pdb.ano)) )

## Read aligned coordinate data (all the info you want is in here)
pdbs <- read.fasta.pdb(aln)

answer2 <- cbind( 1:ncol(pdbs$ali), t(pdbs$ali), 
            pdbs$resno[1,], matrix(pdbs$xyz[1,], ncol=3, byrow=T) ) 

head(answer2)

[1,] "1" "M"        "M"    "1" "62.935" "97.579" "30.223"
[2,] "2" "T"        "T"    "2" "63.155" "95.525" "27.079"
[3,] "3" "E"        "E"    "3" "65.289" "96.895" "24.308"
[4,] "4" "Y"        "Y"    "4" "64.899" "96.22"  "20.615"
[5,] "5" "K"        "K"    "5" "67.593" "96.715" "18.023"
[6,] "6" "L"        "L"    "6" "65.898" "97.863" "14.816"

There is a aa321() function in bio3d if you want your amino acids listed in 3 letter code.

这篇关于从蛋白质数据库的蛋白质序列比对宇宙或UNIPROT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆