查找字符串序列之间的汉明距离 [英] Find the Hamming distance between string sequences

查看:128
本文介绍了查找字符串序列之间的汉明距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个3156个DNA序列的数据集,每个序列都有98290个字符(SNP),包括(通常)5个符号: A,C,G,T,N (缺口).

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).

找到这些序列之间的成对汉明距离的最佳方法是什么?

请注意,对于每个序列,我实际上想查找序列数量(包括其自身)的倒数,其中每个站点的汉明距离小于某个阈值(在此示例中为0.1).

到目前为止,我已经尝试了以下操作:

So far, I have attempted the following:

library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
 temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}

snpdatlist变量,其中snpdat[[i]]包含第i个DNA序列.

snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.

在具有16GB内存的i7-4790核心上运行大约需要36分钟. 我还尝试使用stringdist程序包,这会花费更多时间来生成相同的结果.

This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram. I also tried using the stringdist package, which takes more time to generate the same result.

我们非常感谢您的帮助!

Any help is highly appreciated!

推荐答案

我不确定这是否是最佳解决方案,但是使用Rcpp,我能够将运行时间缩短到15分钟左右.我将在此处编写代码,以防某天有人觉得它有用...

I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...

这是C ++代码(我使用了 Sugar 运算符这里)...

This is the C++ code (I have used Sugar operators here)...

#include <Rcpp.h>
using namespace Rcpp;

double test5(const List& C, const int& x){
  double HD;
  for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
  return HD;
}

编译后:

library(Rcpp)
sourceCpp("hd_code.cpp")

我只是从R中调用此函数:

I simply call this function from R:

library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t

有人能想到一种更快的方法吗?

Can anyone think of an even quicker way to do this?

这篇关于查找字符串序列之间的汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆