2个巨大文件之间的最长公共子字符串-内存不足:Java堆空间 [英] longest common substring between 2 HUGE files - out of memory: java heap space

查看:82
本文介绍了2个巨大文件之间的最长公共子字符串-内存不足:Java堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此后,我完全被炸透了,我需要在2个文件(一个小文件和一个大文件)之间找到最长的公用子字符串.我什至不知道从哪里开始搜索,这就是我到目前为止所拥有的

I'm completely brain fried after this, I need to find the longest common substring between 2 files, a small one and a HUGE one. I don't even know where to start to begin the search, heres what I have so far

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class MyString
{
    public static void main (String[] args) throws IOException
    {
        BufferedReader br = new BufferedReader(new FileReader("MobyDick.txt"));
        BufferedReader br2 = new BufferedReader(new FileReader("WarAndPeace.txt"));
        String md, wp;
        StringBuilder s = new StringBuilder();
        while ((md = br.readLine()) != null)
        {
            s.append(md).append(" ");
        }
        md = s + "";
        s.setLength(0);
        while ((wp = br2.readLine()) != null)
        {
            s.append(wp).append(" ");
        }
        wp = s + "";
        s.setLength(0);

        md = md.replaceAll("\\s+", " "); //rids of double spaces
        wp = wp.replaceAll("\\s+", " "); //rids of double spaces
    }
}

到目前为止,我所做的是将每个文件放入一个字符串生成器中,然后放入一个字符串中以消除双倍空格(MobyDick.txt上使用了很多).我找到了这段代码

what i did so far was to put each file into a string builder, and then into a string to rid of double spaces (it came up a lot on MobyDick.txt). I found this code

public static String longestSubstring(String str1, String str2) {

StringBuilder sb = new StringBuilder();
if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty())
  return "";

// ignore case
str1 = str1.toLowerCase();
str2 = str2.toLowerCase();

// java initializes them already with 0
int[][] num = new int[str1.length()][str2.length()];
int maxlen = 0;
int lastSubsBegin = 0;

for (int i = 0; i < str1.length(); i++) {
for (int j = 0; j < str2.length(); j++) {
if (str1.charAt(i) == str2.charAt(j)) {
if ((i == 0) || (j == 0))
   num[i][j] = 1;
else
   num[i][j] = 1 + num[i - 1][j - 1];

if (num[i][j] > maxlen) {
  maxlen = num[i][j];
  // generate substring from str1 => i
  int thisSubsBegin = i - num[i][j] + 1;
  if (lastSubsBegin == thisSubsBegin) {
     //if the current LCS is the same as the last time this block ran
     sb.append(str1.charAt(i));
  } else {
     //this block resets the string builder if a different LCS is found
     lastSubsBegin = thisSubsBegin;
     sb = new StringBuilder();
     sb.append(str1.substring(lastSubsBegin, i + 1));
  }
  }
  }
  }}

  return sb.toString();
  }

此代码有帮助,但仅在小文件上有用,每次我对大文件运行时,都会得到一个 内存不足:Java堆空间"错误.我需要正确的算法来解决堆空间问题,而且我不能增加Java内存,没有人可以帮助或指出正确的方向吗?

this code helps but only on small files, every time I run it with the big files I get a "out of memory: java heap space" error. I need the right algorithm to get away from the heap space issue, and no i cant increase java memory, can anyone help or point me in the right direction?

推荐答案

首先,您需要准确确定为什么这样的内存占用量,然后可以开始解决它.

First you need to identify exactly why this is such a memory hog, and then you can start to work around it.

此声明跳出一个潜在的问题:

This declaration jumps out as a potential problem:

int[][] num = new int[str1.length()][str2.length()];

《战争与和平》的长度超过300万个字符,而Moby Dick的长度大约是其一半,因此我们可以保守地说它的长度为一百万个字符.

War and Peace is over 3 million characters long, and Moby Dick is about half the length of it so we will conservatively say it's a million characters long.

您正在尝试为3,000,000,000,000个整数分配空间,每个整数为4个字节,即12,000,000,000,000个字节或11 TB以下的位.

You are trying to allocate space for 3,000,000,000,000 integers, each of which is 4 bytes, which works out to be 12,000,000,000,000 bytes or a bit under 11 TB.

希望很清楚,为什么该算法不适合这种长度的字符串.

Hopefully it's clear why the algorithm is not suited for strings of this length.

值得庆幸的是,计算机科学中的一项基本理论是,您可以始终以时间为代价来交换内存,反之亦然.

Thankfully, one of the principle theories in computer science is that you can always trade time for memory and vice versa.

相反,您想尝试使用广义后缀树.这具有\ Theta(n + m)的存储成本,并且可以在\ Theta(n + m)中构造,这更易于管理.

Instead you want to try a generalized suffix tree. This has a memory cost of \Theta(n + m) and can be constructed in \Theta(n + m) which is much more manageable.

此处是关于O(n)算法生成此类树.

Here is an excellent guide to a O(n) algorithm to generate such trees.

一旦有了后缀树,就可以通过查找树中最深的节点(其子树包含两个输入字符串的子字符串)来在最短的时间内找到LCS.一种典型的策略是,如果满足条件,则用标记"i"标记所有节点"v":

Once you have the suffix tree in place, finding the LCS can be done in constant time by finding the deepest node in the tree whose subtree contains a substring of both input strings. A typical strategy is to mark all nodes 'v' with a flag 'i' if they satisfy the property:

根为v的子树包含字符串S_i

,然后找到最深的节点v,其中对于该范围内的所有 i (在本例中为0和1),v都标记为 i .

and then find the deepest node v where v is marked as i for all i in the range (in this case, just 0 and 1).

这篇关于2个巨大文件之间的最长公共子字符串-内存不足:Java堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆