需要帮助设计用于搜索算法以更有效的方式 [英] need help designing for search algorithm in a more efficient way

查看:74
本文介绍了需要帮助设计用于搜索算法以更有效的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个涉及生物学领域的一个问题。现在,我有4个非常大的文件(每个0.1十亿线),但结构很简单,这些文件中的每行只有2场,既代表了一种基因。

I have a problem that involves biology area. Right now I have 4 VERY LARGE files(each with 0.1 billion lines), but the structure is rather simple, each line of these files has only 2 fields, both stands for a type of gene.

我的目标是:设计一个高效的算法,可以实现如下: 找到这4个文件的内容中的一个圆。圆定义为:

My goal is: design an efficient algorithm that can achieves the following: Find a circle within the contents of these 4 files. The circle is defined as:

field #1 in a line in file 1 == field #1 in a line in file 2 and
field #2 in a line in file 2 == field #1 in a line in file 3 and
field #2 in a line in file 3 == field #1 in a line in file 4 and
field #2 in a line in file 4 == field #2 in a line in file 1

我想不出一个体面的方式来解决这个问题,所以我只是写了一个强力笨-4层嵌套循环现在。我正在考虑对它们进行排序的字母顺序,即使这可能有点帮助,但它也很明显,计算机内存不会让我在一次加载的一切。谁能告诉我解决这个问题,在时间和空间的有效途径的好方法?谢谢!

I cannot think of a decent way to solve this, so I just wrote a brute-force-stupid-4-layer-nested loop for now. I'm thinking about sorting them as alphabetical order, even if that might help a little, but then it's also obvious that the computer memory would not allow me to load everything at once. Can anybody tell me a good way to solve this problem in a both time and space efficient way? Thanks!!

推荐答案

首先,我注意到,您可以整理文件,而不抱着它记忆一下子,而且大多数的操作系统有一定的程序,这样做,往往就叫排序。通常你可以得到它的在一个文件中的字段进行排序,但如果没有你可以重写每一行得到它,你想要的方式进行排序。

First of all, I note that you can sort a file without holding it memory all at once, and that most operating systems have some program that does this, often called just "sort". Usually you can get it to sort on a field within a file, but if not you can rewrite each line to get it to sort the way you want.

鉴于此,您可以使得第一排序在现场#1,第二现场#2排序它们连接两个文件。然后,您可以创建一个记录每场比赛,结合所有的领域,只有持有一个内存块的每个文件,所有你已把上的字段具有相同的值。这将允许你把结果与另一个文件连接 - 四个这样的连接应该能解决你的问题。

Given this, you can connect two files by sorting them so that the first is sorted on field #1 and the second on field #2. You can then create one record for each match, combining all the fields, and only holding in memory a chunk from each file where all the fields you have sorted on have the same value. This will allow you to connect the result with another file - four such connections should solve your problem.

根据数据,所花费的时间解决你的问题可能取决于在其中进行连接的顺序。一个比较幼稚的方式来利用这一点,在每个阶段,采取从每个文件的小随机抽样,并用它来看看有多少结果将按照从每个可能的连接,并选择产生最少的结果的连接。取N个项目中随机抽取的一个大文件的一种方法是取前N行的文件中,然后,当你阅读m行至今,读取下一行,然后用概率N /(M + 1)交换举行它​​的N行中的一个,其他人把它扔掉。继续,直到你完成整个文件都读。

Depending on your data, the time it takes to solve your problem may depend on the order in which you make the connections. One rather naive way to make use of this is, at each stage, to take a small random sample from each file, and use this to see how many results will follow from each possible connection, and choose the connection that produces the fewest results. One way to take a random sample of N items from a large file is to take the first N lines in the file and then, when you have read in m lines so far, read the next line, and then with probability N/(m + 1) exchange one of the N lines held for it, else throw it away. Keep on until you have read through the whole file.

这篇关于需要帮助设计用于搜索算法以更有效的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆