在 R 中生成 DNA 密码子组合 [英] Generating DNA codon combinations in R

查看:78
本文介绍了在 R 中生成 DNA 密码子组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 R 中生成随机 DNA 序列,其中每个序列都具有固定长度并包含用户指定的核苷酸分布.

I am generating random DNA sequences in R where each sequence is of a set length and contains a user-specified distribution of nucleotides.

我想要做的是确保不会在给定序列中生成某些核苷酸序列.不允许的运行是:aga"、agg"、taa"、tag"和tga".

What I want to be able to do is ensure certain runs of nucleotides are NOT generated in a given sequence. The runs that are disallowed are: "aga", "agg", "taa", "tag" and "tga".

这是我的代码,它只是生成可能发生上述运行的序列.我不确定如何最好地修改代码以解决上面指定的禁忌"运行.

Here is my code that simply generates sequences where the above runs MAY occur. I am unsure how best to modify the code to account for the "tabu" runs specified above.

library(ape)

length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies

# DNA alphabet
nucl <- as.DNAbin(c('a', 'c', 'g', 't')) # A, C, G, T

# Randomly sample nucleotides
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs) 

我想简单地列出所有允许的运行,这些运行将用于代替 'nucl' 并在 sample() 函数中指定 'size' = length.seqs/3,但这看起来很麻烦,即使有像这样的快捷方式'expand.grid()'.

I am thinking to simply list all the allowed runs which would be used in place of 'nucl' and specify 'size' = length.seqs / 3 within the sample() function, but this seems cumbersome, even with shortcuts like 'expand.grid()'.

推荐答案

你可以像这样使用正则表达式:

You could regex your way to it like this:

length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies
nucl <- c('a', 'c', 'g', 't') # A, C, G, T

seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs)

bad_codons <- c("aga", "agg", "taa", "tag", "tga")

regx <- paste0("(", paste(bad_codons, collapse = ")|("), ")")

s <- paste(seqs, collapse = "")

while( grepl(regx, s) ) {
  s <- gsub(regx,
            paste(sample(nucl, size = 3, replace = TRUE, prob = nucl.freqs), collapse = ""),
            s)
}

s
grepl(regex, s)

这个想法是用新的模拟替换坏密码子,直到不再存在坏密码子.如果您需要长时间或大量序列的性能,这可能不是一个好方法.

The idea is to replace the bad codons with fresh simulations until no more bad codons exist. If you need performance over long or lots of sequences this might not be a good route.

这篇关于在 R 中生成 DNA 密码子组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆