C#中的内存不足异常 [英] Out of Memory Exception In C#
问题描述
我正在尝试构造后缀trie,由于严格的要求,必须在内存中对其进行索引.
I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.
问题不在于树本身,而是实际上我在读取文件的方式.
The problem is not the tree itself, but actually the way I was reading the file.
推荐答案
如果将整个文本文件作为单个string
传递,则在第一个循环中很容易遇到内存不足异常!
If you're passing the entire text file as a single string
you could easily run into an out of memory exception with your first loop!
// imagine if s.Length was 100k or so
for (int i = 0; i < s.Length; i++)
{
AddString(s.Substring(i, s.Length-i));
}
在读取文件以构造Trie时,您需要分割每一行并可能将字符标准化:
When reading the file to construct the trie, you'll need to split each line and probably normalize the characters:
string line;
while (null != (line = reader.ReadLine()))
{
string[] parts = line.Split(' ', ',', '.', '!', '\t', '?'); // naive
foreach (string part in parts)
{
if (part.Length > 0)
{
// make each string uppercase so as to avoid Hello and hello being
// two trie entries
trie.AddSuffix(part.ToUpperInvariant());
}
}
}
例如(在dir /b c:\windows
的输出上):
A
D
D
I
N
S
E
D
P
P
C
O
M
P
A
T
P
A
T
C
H
...
要适当地处理较大的文件,将需要更紧凑的特里结构.我只是将未共享的后缀存储在单独的字典中:
To appropriately handle larger files, a more compact trie structure would be desirable. I would just have unshared suffixes stored in a separate dictionary:
// If you add a character, but there is no entry in m_children
// just park the tail end of it here
Dictionary<char, string> m_tails;
然后,将每个字符的逻辑移到SuffixNode
的AddString
中:
You would then move the per character logic to your AddString
of the SuffixNode
:
public void AddString(string s)
{
if (s.Length == 0) return;
char c = s[0];
if (m_children.ContainsKey(c))
{
if (s.Length > 1) m_children[c].AddString(s.Substring(1));
}
else if (m_tails.ContainsKey(c))
{
SuffixNode node = new SuffixNode();
node.AddString(m_tails[c]);
if (s.Length > 1) node.AddString(s.Substring(1));
m_children.Add(c, node);
m_tails.Remove(c);
}
else
{
m_tails.Add(c, s.Length > 1 ? s.Substring(1) : "");
}
}
现在,您有一个更加紧凑的trie版本,它将大大减少为任何给定语料库创建的子SuffixNode
的数量.回到dir /b c:\windows
示例,我们可以看到节点的实际减少:
Now you have a much more compact version of the trie, which will greatly decrease the number of child SuffixNode
s created for any given corpus. Returning to the dir /b c:\windows
example, we can see a practical reduction in nodes:
A
P
P
COMPAT
PATCH
I
T
I
O
N
S
...
在这一点上,您的特里有一个更有效的表示.您需要确定如何处理终端节点表示形式,以确保查找准确.
At this point your trie has a more efficient representation. You're left with determining how to deal with terminal node representations in order to ensure lookups are accurate.
这篇关于C#中的内存不足异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!