Unicode 正则表达式;无效的 XML 字符 [英] Unicode Regex; Invalid XML characters

查看:77
本文介绍了Unicode 正则表达式;无效的 XML 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有效的 XML 字符列表是众所周知的,正如其规范所定义的:

The list of valid XML characters is well known, as defined by the spec it's:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

我的问题是,是否可以通过使用 Unicode 通用类别,在不实际硬编码代码点的情况下为此(或其逆)制作 PCRE 正则表达式.反义词可能类似于 [\p{Cc}\p{Cs}\p{Cn}],除了不正确地覆盖换行符和制表符并遗漏了一些其他无效字符.

My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

推荐答案

我知道这不能完全回答您的问题,但将其放在此处很有帮助:

I know this isn't exactly an answer to your question, but it's helpful to have it here:

匹配有效 XML 字符的正则表达式:

Regular Expression to match valid XML Characters:

[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]

因此,要从 XML 中删除无效字符,您可以执行类似的操作

So to remove invalid chars from XML, you'd do something like

// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
    @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
    RegexOptions.Compiled);

/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
    if (string.IsNullOrEmpty(text)) return "";
    return _invalidXMLChars.Replace(text, "");
}

我有我们常驻的正则表达式/XML 天才,他在 4,400 多个赞成的帖子中,检查这个,然后他签字.

I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

这篇关于Unicode 正则表达式;无效的 XML 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆