DNA序列的混沌游戏 [英] chaos game for DNA sequences

查看:33
本文介绍了DNA序列的混沌游戏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已尝试使用 mathematica 代码制作此地址中发布的 DNA 序列的混沌游戏:http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

是这样的:

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

我拥有的 fasta 序列只是一个像 AACCTTTGATCAAA 这样的字母序列生成的图是这样的:

the fasta sequence that I have is just a sequence of letters like AACCTTTGATCAAA and the graph to be generated comes like this:

代码在处理小序列时工作正常,但是当我想放入一个巨大的序列时,例如几乎 40Mb 的染色体,程序需要很多时间并且只显示一个黑色方块,因此无法进行分析.是否可以改进上述代码,以便显示它的正方形会更大?顺便说一下,正方形必须只是正方形单位.提前感谢您的帮助

the code works fine with small sequences, but when I want to put a huge sequence, for example almost 40Mb of a chromosome, the program takes a lot of time and only displays a black square so that it is impossible to analyze. Is it possible to improve the aforementioned code, so that the square in which it would be displayed it would be bigger?, by the way the square must be only the square unit. Thanks for your help in advance

推荐答案

以下增量编辑摘要:

这将使您在使用编译代码(不包括计算shifts)计算点坐标时显着加快:

This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing shifts):

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

您的代码中的瓶颈实际上是渲染图形,我们不是绘制每个点,而是将点的密度可视化:

The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

如果一个区域至少有 threshold 点,它就会被涂成黑色.size 是图像尺寸.通过选择大尺寸或大阈值,您可以避免黑方问题".

A region will be coloured black if it has at least threshold points. size is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".

我原来的回答有更多细节:

在我相当陈旧的机器上,代码不是很慢.

On my rather dated machine, the code is not very slow.

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

我得到了 6.8 秒的计时,除非您需要在循环中多次运行它(如果它对您的用例和机器来说不够快,请添加评论,我们将尝试加快速度).

I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).

不幸的是,渲染图形需要比这更长的时间(36 秒),我不知道您是否可以对此做些什么.禁用抗锯齿可能会有所帮助,具体取决于您的平台,但作用不大:Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing ->错误](对我来说不是).这对我们许多人来说是一个长期的烦恼.

Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much: Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False] (for me it doesn't). This is a long-standing annoyance for many of us.

关于整个图形是黑色的,您可以使用鼠标调整其大小并使其变大.下次计算表达式时,输出图形将记住其大小.或者只是使用 ImageSize ->800 作为 Graphics 选项.考虑到屏幕的像素密度,我能想到的唯一其他解决方案(不涉及调整图形大小)是使用灰色阴影表示像素密度,并绘制密度.

Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use ImageSize -> 800 as a Graphics option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.

这是绘制密度的方式(这也比点图的计算和渲染速度快得多!):

This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

调整分辨率使情节好看.

Play with the resolution to make the plot nice.

对于我的随机序列示例,这仅给出了一个灰色图.对于您的基因组数据,它可能会提供更有趣的模式.

For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.

编辑 2:

这是一种使用编译加速函数的简单方法:

Here's a simple way to speed up the function using compilation:

首先,用移位向量替换字符(对于一个数据集只需要做一次,然后你可以保存结果):

First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

然后让我们编译我们的函数:

Then let's compile our function:

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

如果您的 Mathematica 版本低于 8 或者您没有安装 C 编译器,请删除 CompilationTarget.

Remove CompilationTarget if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.

fun[arr]; // Timing

给我 0.6 秒,这是一个 10 倍的即时加速.

gives me 0.6 seconds, which is an instant 10x speedup.

编辑 3:

与上述编译版本相比,通过避免编译函数中的一些内核回调,可以实现约 5 倍的加速(我使用 CompilePrint 检查编译输出以提出此版本 --- 否则它是不明显为什么它更快):

Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using CompilePrint to come up with this version --- otherwise it's not obvious why it's faster):

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

这在我的机器上运行时间为 0.11 秒.在更现代的机器上,即使对于 40 MB 的数据集,它也应该在几秒钟内完成.

This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.

我将转置拆分为单独的输入,因为此时 fun1d 的运行时间开始与 Transpose 的运行时间相当.

I split off the transpositions into separate inputs because at this point the running time of fun1d starts to get comparable to the running time of Transpose.

这篇关于DNA序列的混沌游戏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆