在字节的字符,而不是.NET正EX pressions [英] .NET Regular expressions on bytes instead of chars

查看:137
本文介绍了在字节的字符,而不是.NET正EX pressions的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图做一些分析,这将使用常规EX pressions更容易。

I'm trying to do some parsing that will be easier using regular expressions.

输入是一个字节数组(或枚举)。

The input is an array (or enumeration) of bytes.

我不希望将字节转换为字符,原因如下:

I don't want to convert the bytes to chars for the following reasons:

  1. 在计算效率
  2. 内存消耗效率
  3. 某些不可打印的字节可能是复杂的转换为字符。并不是所有的字节可打印的。

所以,我不能使用<一个href="http://msdn.microsoft.com/en-us/library/system.text.regularex$p$pssions.regex%28v=VS.71%29.aspx">Regex.

我所知道的唯一的解决办法,是使用 Boost.Regex ​​(这适用于字节 - Visual C字符),但是这是一个C ++库使用C​​ ++ / CLI将需要相当的工作,包装

The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.

我怎么可以使用普通的EX pressions在直接.NET字节为单位,而不使用.NET字符串和字符?

How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?

感谢你。

推荐答案

有一点的阻抗失配会在这里。你想在.NET正EX pressions其中使用字符串(多字节字符)的工作,但你要使用单字节字符的工作。您可以在两者使用的.Net按通常的同时不具备的。

There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.

不过,要下打破这种不匹配,你可以处理字符串在面向字节的时尚和变异它。然后将突变字符串可以作为一个可重复使用的缓冲器。这样一来,你就不必字节转换为字符,或将您的输入缓冲区为字符串(根据你的问题)。

However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).

一个例子:

//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };

string stringBuffer = new string('\0', 1000);

Regex regex = new Regex("ING", RegexOptions.Compiled);

unsafe
{
    fixed (char* charArray = stringBuffer)
    {
        byte* buffer = (byte*)(charArray);

        //Hard-coded example of string mutation, in practice you would
        //loop over your input buffers and regex\match so that the string
        //buffer is re-used.

        buffer[0] = inputBuffer[0];
        buffer[2] = inputBuffer[1];
        buffer[4] = inputBuffer[2];
        buffer[6] = inputBuffer[3];
        buffer[8] = inputBuffer[4];

        Console.WriteLine("Mutated string:'{0}'.",
             stringBuffer.Substring(0, inputBuffer.Length));

        Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);

        Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
    }
}

使用这个技术,你可以分配一个字符串缓冲,它可以被重新用作输入正则表达式,但你可以用你的字节每次发生变异了。这避免了转换\你想要做一个匹配,每次编码的字节数组到一个新的.NET字符串的开销。这可能证明是非常显著因为我看到了许多在.NET中的算法尝试去一百万英里的时速只能由字符串生成和随后的堆的垃圾邮件和GC所花费的时间将一蹶不振。

Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.

这显然是不安全的code,但它是.NET。

Obviously this is unsafe code, but it is .Net.

正则表达式的结果虽然生成的字符串,所以你有一个问题在这里。我不知道是否有使用正则表达式将不会产生新的字符串的方法。你当然可以在比赛索引和长度的信息,但该字符串代侵犯了您的要求对内存的效率。

The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.

更新

其实拆卸正则表达式\匹配\集团\捕获后,它看起来像它只是产生捕获的字符串时,您访问Value属性,所以你可能至少不会产生串,如果你只访问索引和长度属性。但是,你会产生所有支持正则表达式的对象。

Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.

这篇关于在字节的字符,而不是.NET正EX pressions的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆