在字节数组中查找字节序列 [英] Find byte sequence within a byte array
问题描述
我有一个字节数组,希望找到某些字节的出现次数".
I have a byte array and wish to find the "occurrences" of some bytes.
例如,在非常大的字节数组(> 50/100 MB)中的00 69 73 6F 6D
For example 00 69 73 6F 6D
in a very large byte array (> 50/100 Megabytes)
OR
甚至更好的反向操作:搜索最常见的模式而不知道它的代码应该能够从文件中读取并找到它.
even better a reverse operation: Searching the most common pattern without knowing it the code should be able to read and find it from the file.
推荐答案
您可以使用Boyer-Moore算法来高效地搜索字节数组中的字节序列.
You can use the Boyer-Moore algorithm to efficiently search for a sequence of bytes in an array of bytes.
这是我从 Boyer-Moore上的Wikipedia条目从Java版本转换过来的C#版本. >.
Here's a C# version I converted from the Java version from the Wikipedia entry on Boyer-Moore.
public sealed class BoyerMoore
{
readonly byte[] needle;
readonly int[] charTable;
readonly int[] offsetTable;
public BoyerMoore(byte[] needle)
{
this.needle = needle;
this.charTable = makeByteTable(needle);
this.offsetTable = makeOffsetTable(needle);
}
public IEnumerable<int> Search(byte[] haystack)
{
if (needle.Length == 0)
yield break;
for (int i = needle.Length - 1; i < haystack.Length;)
{
int j;
for (j = needle.Length - 1; needle[j] == haystack[i]; --i, --j)
{
if (j != 0)
continue;
yield return i;
i += needle.Length - 1;
break;
}
i += Math.Max(offsetTable[needle.Length - 1 - j], charTable[haystack[i]]);
}
}
static int[] makeByteTable(byte[] needle)
{
const int ALPHABET_SIZE = 256;
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.Length; ++i)
table[i] = needle.Length;
for (int i = 0; i < needle.Length - 1; ++i)
table[needle[i]] = needle.Length - 1 - i;
return table;
}
static int[] makeOffsetTable(byte[] needle)
{
int[] table = new int[needle.Length];
int lastPrefixPosition = needle.Length;
for (int i = needle.Length - 1; i >= 0; --i)
{
if (isPrefix(needle, i + 1))
lastPrefixPosition = i + 1;
table[needle.Length - 1 - i] = lastPrefixPosition - i + needle.Length - 1;
}
for (int i = 0; i < needle.Length - 1; ++i)
{
int slen = suffixLength(needle, i);
table[slen] = needle.Length - 1 - i + slen;
}
return table;
}
static bool isPrefix(byte[] needle, int p)
{
for (int i = p, j = 0; i < needle.Length; ++i, ++j)
if (needle[i] != needle[j])
return false;
return true;
}
static int suffixLength(byte[] needle, int p)
{
int len = 0;
for (int i = p, j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; --i, --j)
++len;
return len;
}
}
以下是一些控制台应用测试代码:
Here's some console app test code for it:
public static void Main()
{
byte[] haystack = new byte[10000];
byte[] needle = { 0x00, 0x69, 0x73, 0x6F, 0x6D };
// Put a few copies of the needle into the haystack.
for (int i = 1000; i <= 9000; i += 1000)
Array.Copy(needle, 0, haystack, i, needle.Length);
var searcher = new BoyerMoore(needle);
foreach (int index in searcher.Search(haystack))
Console.WriteLine(index);
}
请注意Search()
方法如何返回haystack
中needle
起始位置的所有位置的索引.
Note how the Search()
method returns the indices of all the locations of the start of needle
inside haystack
.
如果您只想计数,就可以这样做:
If you just wanted the count, you could just do:
int count = new BoyerMoore(needle).Search(haystack).Count();
对于第二个问题:我假设您是在询问寻找最长的重复字节序列吗?
For your second question: I assume you are asking about finding the longest repeated sequence of bytes?
这是一个非常复杂的问题,而且非常不同.如果您希望得到答案,则应该为此提出一个单独的问题,但是您应该阅读最长重复子串"上的Wikipedia条目.问题" .
That's a much more complicated - and very different - question. If you want an answer for that, you should ask a separate question for it, but you should read the Wikipedia entry on the "longest repeated substring problem".
这篇关于在字节数组中查找字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!