查找字符串序列之间的汉明距离 [英] Find the Hamming distance between string sequences
问题描述
我有一个3156个DNA序列的数据集,每个序列都有98290个字符(SNP),包括(通常)5个符号: A,C,G,T,N (缺口).
I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
找到这些序列之间的成对汉明距离的最佳方法是什么?
请注意,对于每个序列,我实际上想查找序列数量(包括其自身)的倒数,其中每个站点的汉明距离小于某个阈值(在此示例中为0.1).
到目前为止,我已经尝试了以下操作:
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat
是list
变量,其中snpdat[[i]]
包含第i
个DNA序列.
snpdat
is a list
variable where snpdat[[i]]
contains the i
th DNA sequence.
在具有16GB内存的i7-4790核心上运行大约需要36分钟.
我还尝试使用stringdist
程序包,这会花费更多时间来生成相同的结果.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist
package, which takes more time to generate the same result.
我们非常感谢您的帮助!
Any help is highly appreciated!
推荐答案
我不确定这是否是最佳解决方案,但是使用Rcpp,我能够将运行时间缩短到15分钟左右.我将在此处编写代码,以防某天有人觉得它有用...
I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...
这是C ++代码(我使用了 Sugar 运算符这里)...
This is the C++ code (I have used Sugar operators here)...
#include <Rcpp.h>
using namespace Rcpp;
double test5(const List& C, const int& x){
double HD;
for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
return HD;
}
编译后:
library(Rcpp)
sourceCpp("hd_code.cpp")
我只是从R中调用此函数:
I simply call this function from R:
library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t
有人能想到一种更快的方法吗?
Can anyone think of an even quicker way to do this?
这篇关于查找字符串序列之间的汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!