正则表达式不使用Unicode字符范围 [英] Unicode character range not being consumed by Regex

查看:86
本文介绍了正则表达式不使用Unicode字符范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意

又问了一个问题带有#的C#正则表达式模式中已经有Uxxxxxxxx个字符.这个问题的不同之处在于,它不是关于代理对的计算方式,而是关于在正则表达式中如何表示大于0的unicode平面.从我的问题中应该很清楚,我已经理解了为什么这些代码单元被表示为2个字符-它们是代理对(这是另一个问题要问的).我的问题是我应该如何进行通用转换(因为我无法控制馈入程序的正则表达式是什么样子),以便它们可以由.NET Regex引擎使用.

Another question was asked C# Regular Expressions with \Uxxxxxxxx characters in the pattern already. This question differs in that it is not about how surrogate pairs are calculated, but how to express unicode planes higher than 0 in a regex. It should be clear from my question that I already understand why these code units are being expressed as 2 characters - they are surrogate pairs (which was what the other question is asking about). My question is how can I convert them generically (since I have no control over what the regex being fed to the program looks like) so they can be consumed by the .NET Regex engine.

请注意,我现在可以执行此操作,并且想将我的答案添加到我的问题中,但是由于现在已将其标记为重复,因此我无法添加我的答案.

Note I now have a way to do this and would like to add my answer to my question, but since this is now marked as a duplicate I cannot add my answer.

我有一些测试数据正在传递到我要移植到c#的Java库中.我以一个特定的问题案例为例.原始字符类为UTF-32 = \ U0001BCA0- \ U0001BCA3 ,. NET不易使用它-我们得到无法识别的转义序列\ U" 错误.

I have some test data that is being passed to a Java library that I am porting to c#. I have isolated a specific problem case as an example. The character class in the original was in UTF-32 = \U0001BCA0-\U0001BCA3, which is not readily consumable by .NET - we get an "Unrecognized escape sequence \U" error.

我尝试转换为UTF-16,并且已确认 \ U0001BCA0的结果 \ U0001BCA3 是应有的期望.

I attempted to convert to UTF-16 and I have confirmed the results for \U0001BCA0 and \U0001BCA3 are what should be expected.

UTF-32      | Codepoint   | High Surrogate  | Low Surrogate  | UTF-16
---------------------------------------------------------------------------
0x0001BCA0  | 113824      | 55343           | 56480          | \uD82F\uDCA0
0x0001BCA3  | 113827      | 55343           | 56483          | \uD82F\uDCA3

但是,当我将字符串(([\ uD82F \ uDCA0- \ uD82F \ uDCA3])传递给 Regex 类的构造函数时,我取得异常"[xy]范围以相反的顺序" .

However, when I pass the string "([\uD82F\uDCA0-\uD82F\uDCA3])" to the constructor of the Regex class, I get an exception "[x-y] range in reverse order".

尽管很清楚字符是按正确的顺序指定的(它在Java中有效),但我反向尝试并得到了相同的错误消息.

Although it is pretty clear the characters are specified in the right order (it works in Java), I tried in reverse and got the same error message.

我还尝试将UTF-32字符从 \ U0001BCA0- \ U0001BCA3 更改为 \ x01BCA0- \ x01BCA3 ,但仍会获得异常"[xy]范围以相反的顺序".

I also tried changing the UTF-32 characters from \U0001BCA0-\U0001BCA3 to \x01BCA0-\x01BCA3, but still get the exception "[x-y] range in reverse order".

那么,如何获取.NET Regex 类来成功解析此字符范围?

So, how do I get the .NET Regex class to parse this character range successfully?

注意::我尝试更改代码以生成包含所有字符而不是范围的正则表达式字符类,它似乎可以正常工作,但这将使我的正则表达式变成几十个字符变成数千个字符,这肯定不会为性能带来奇迹.

NOTE: I tried changing the code to generate a regex character class that includes all of the characters instead of a range and it seems to work, but that is going to turn my regexes that are a few dozen characters into several thousand characters, which surely isn't going to do wonders for performance.

实际正则表达式示例

同样,以上是一个较大字符串失败的孤立示例.我正在寻找的是一种转换此类正则表达式的通用方法,以便可以通过.NET Regex 类进行解析.

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

推荐答案

这个问题的其他贡献者提供了一些线索,但我需要一个答案.我的测试是一个规则引擎,该规则引擎由从文件输入中构建的正则表达式驱动,因此,将逻辑硬编码到C#中不是一种选择.

While the other contributors to this question provided some clues, I needed an answer. My test is a rules engine that is driven by a regex that is built up from file input, so hard coding the logic into C# is not an option.

但是,我确实在这里学到了

However, I did learn here that

  1. .NET Regex 类不支持代理对,并且
  2. 您可以使用正则表达式更改来伪造对代理对范围的支持

但是,当然,在数据驱动的情况下,我无法手动将正则表达式更改为.NET可以接受的格式-我需要将其自动化.因此,我创建了下面的 Utf32Regex 类,该类直接在构造函数中接受UTF32字符,并将其内部转换为.NET可以理解的正则表达式.

But of course, in my data-driven case I can't manually change the regexes to a format that .NET will accept - I need to automate it. So, I created the below Utf32Regex class that accepts UTF32 characters directly in the constructor and internally converts them to regexes that .NET understands.

例如,它将转换正则表达式

For example, it will convert the regex

"[abc\\U00011DEF-\\U00013E07]"

收件人

"(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

收件人

"((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + 
"\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" + 
"\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" + 
"\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" + 
"\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"

Utf32Regex.cs

using System;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;

/// <summary>
/// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
/// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
/// like <c>\U00010000-\U00010001</c>.
/// </summary>
public class Utf32Regex : Regex
{
    private const char MinLowSurrogate = '\uDC00';
    private const char MaxLowSurrogate = '\uDFFF';

    private const char MinHighSurrogate = '\uD800';
    private const char MaxHighSurrogate = '\uDBFF';

    // Match any character class such as [A-z]
    private static readonly Regex characterClass = new Regex(
        "(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
        RegexOptions.Compiled);

    // Match a UTF32 range such as \U000E01F0-\U000E0FFF
    // or an individual character such as \U000E0FFF
    private static readonly Regex utf32Range = new Regex(
        "(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
        RegexOptions.Compiled);

    public Utf32Regex()
        : base()
    {
    }

    public Utf32Regex(string pattern)
        : base(ConvertUTF32Characters(pattern))
    {
    }

    public Utf32Regex(string pattern, RegexOptions options)
        : base(ConvertUTF32Characters(pattern), options)
    {
    }

    public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
        : base(ConvertUTF32Characters(pattern), options, matchTimeout)
    {
    }

    private static string ConvertUTF32Characters(string regexString)
    {
        StringBuilder result = new StringBuilder();
        // Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
        // equivalent UTF16 characters
        ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
        // Now find all of the individual characters that were not in ranges and
        // fix those as well.
        ConvertUTF32CharactersToUTF16(result);

        return result.ToString();
    }

    private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
    {
        Match match = characterClass.Match(regexString); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string characterClass = match.Groups[1].Value;
                string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);

                result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                result.Append(convertedCharacterClass); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(regexString.Substring(lastEnd)); // Append tail
    }

    private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
    {
        StringBuilder result = new StringBuilder();
        StringBuilder chars = new StringBuilder();

        Match match = utf32Range.Match(characterClass); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string utf16Chars;
                string rangeBegin = match.Groups["begin"].Value.Substring(2);

                if (!string.IsNullOrEmpty(match.Groups["end"].Value))
                {
                    string rangeEnd = match.Groups["end"].Value.Substring(2);
                    utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
                }
                else
                {
                    utf16Chars = UTF32ToUTF16Chars(rangeBegin);
                }

                result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                chars.Append(utf16Chars); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(characterClass.Substring(lastEnd)); // Append tail of character class

        // Special case - if we have removed all of the contents of the
        // character class, we need to remove the square brackets and the
        // alternation character |
        int emptyCharClass = result.IndexOf("[]");
        if (emptyCharClass >= 0)
        {
            result.Remove(emptyCharClass, 2);
            // Append replacement ranges (exclude beginning |)
            result.Append(chars.ToString(1, chars.Length - 1));
        }
        else
        {
            // Append replacement ranges
            result.Append(chars.ToString());
        }

        if (chars.Length > 0)
        {
            // Wrap both the character class and any UTF16 character alteration into
            // a non-capturing group.
            return "(?:" + result.ToString() + ")";
        }
        return result.ToString();
    }

    private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
    {
        while (true)
        {
            int where = result.IndexOf("\\U00");
            if (where < 0)
            {
                break;
            }
            string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
            result.Replace(where, where + 10, cp);
        }
    }

    private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
    {
        var result = new StringBuilder();
        int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
        int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);

        var beginChars = char.ConvertFromUtf32(beginCodePoint);
        var endChars = char.ConvertFromUtf32(endCodePoint);
        int beginDiff = endChars[0] - beginChars[0];

        if (beginDiff == 0)
        {
            // If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        else
        {
            // If the begin character is not the same, create 3 ranges
            // 1. The remainder of the first
            // 2. A range of all of the middle characters
            // 3. The beginning of the last

            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, MaxLowSurrogate);
            result.Append(']');

            // We only need a middle range if the ranges are not adjacent
            if (beginDiff > 1)
            {
                result.Append("|");
                // We only need a character class if there are more than 1
                // characters in the middle range
                if (beginDiff > 2)
                {
                    result.Append('[');
                }
                AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
                if (beginDiff > 2)
                {
                    result.Append('-');
                    AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
                    result.Append(']');
                }
                result.Append('[');
                AppendUTF16Character(result, MinLowSurrogate);
                result.Append('-');
                AppendUTF16Character(result, MaxLowSurrogate);
                result.Append(']');
            }

            result.Append("|");
            AppendUTF16Character(result, endChars[0]);
            result.Append('[');
            AppendUTF16Character(result, MinLowSurrogate);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        return result.ToString();
    }

    private static string UTF32ToUTF16Chars(string hex)
    {
        int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
        return UTF32ToUTF16Chars(codePoint);
    }

    private static string UTF32ToUTF16Chars(int codePoint)
    {
        StringBuilder result = new StringBuilder();
        UTF32ToUTF16Chars(codePoint, result);
        return result.ToString();
    }

    private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
    {
        // Use regex alteration to on the entire range of UTF32 code points
        // to ensure each one is treated as a group.
        result.Append("|");
        AppendUTF16CodePoint(result, codePoint);
    }

    private static void AppendUTF16CodePoint(StringBuilder text, int cp)
    {
        var chars = char.ConvertFromUtf32(cp);
        AppendUTF16Character(text, chars[0]);
        if (chars.Length == 2)
        {
            AppendUTF16Character(text, chars[1]);
        }
    }

    private static void AppendUTF16Character(StringBuilder text, char c)
    {
        text.Append(@"\u");
        text.Append(Convert.ToString(c, 16).ToUpperInvariant());
    }
}

StringBuilderExtensions.cs

public static class StringBuilderExtensions
{
    /// <summary>
    /// Searches for the first index of the specified character. The search for
    /// the character starts at the beginning and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value)
    {
        return IndexOf(text, value, 0);
    }

    /// <summary>
    /// Searches for the index of the specified character. The search for the
    /// character starts at the specified offset and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <param name="startIndex">The starting offset.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value, int startIndex)
    {
        if (text == null)
            throw new ArgumentNullException("text");
        if (value == null)
            throw new ArgumentNullException("value");

        int index;
        int length = value.Length;
        int maxSearchLength = (text.Length - length) + 1;

        for (int i = startIndex; i < maxSearchLength; ++i)
        {
            if (text[i] == value[0])
            {
                index = 1;
                while ((index < length) && (text[i + index] == value[index]))
                    ++index;

                if (index == length)
                    return i;
            }
        }

        return -1;
    }

    /// <summary>
    /// Replaces the specified subsequence in this builder with the specified
    /// string.
    /// </summary>
    /// <param name="text">this builder.</param>
    /// <param name="start">the inclusive begin index.</param>
    /// <param name="end">the exclusive end index.</param>
    /// <param name="str">the replacement string.</param>
    /// <returns>this builder.</returns>
    /// <exception cref="IndexOutOfRangeException">
    /// if <paramref name="start"/> is negative, greater than the current
    /// <see cref="StringBuilder.Length"/> or greater than <paramref name="end"/>.
    /// </exception>
    /// <exception cref="ArgumentNullException">if <paramref name="str"/> is <c>null</c>.</exception>
    public static StringBuilder Replace(this StringBuilder text, int start, int end, string str)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }
        if (start >= 0)
        {
            if (end > text.Length)
            {
                end = text.Length;
            }
            if (end > start)
            {
                int stringLength = str.Length;
                int diff = end - start - stringLength;
                if (diff > 0)
                { // replacing with fewer characters
                    text.Remove(start, diff);
                }
                else if (diff < 0)
                {
                    // replacing with more characters...need some room
                    text.Insert(start, new char[-diff]);
                }
                // copy the chars based on the new length
                for (int i = 0; i < stringLength; i++)
                {
                    text[i + start] = str[i];
                }
                return text;
            }
            if (start == end)
            {

                text.Insert(start, str);
                return text;
            }
        }
        throw new IndexOutOfRangeException();
    }
}

请注意,这不是很好的测试,可能也不是很可靠,但是出于测试目的,应该没事.

Do note this is not very well tested and probably not very robust, but for testing purposes it should be fine.

这篇关于正则表达式不使用Unicode字符范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆