C＃正则表达式在包含多种不同语言，Unicode字母的文本中删除不可打印的字符和控制字符 [英] C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

查看：364 发布时间：2020/9/25 21:17:32 c# regex unicode

本文介绍了C＃正则表达式在包含多种不同语言，Unicode字母的文本中删除不可打印的字符和控制字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将很感谢您的帮助，因为我不知道要使用哪个字符范围，或者在红宝石中是否找到像[[：cntrl：]]这样的字符类？

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?

通过不可打印的方式，我的意思是删除所有未打印的字符，即在输出打印输出字符串时显示的字符。请注意，我正在寻找ac＃正则表达式，我的代码没有问题

by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

推荐答案

您可以删除所有控件并其他不可打印字符，其中

You may remove all control and other non-printable characters with

s = Regex.Replace(s, @"\p{C}+", string.Empty);

\p {C} Unicode类别类匹配所有控制字符，甚至包括ASCII表之外的所有控制字符，因为在.NET中，Unicode类别类默认情况下支持Unicode。

The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.

将其分解为子类别

要仅匹配基本控制字符，可以使用 ＼ {p {Cc} + ，请参见其他控件 Unicode类别。它等于 [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085 ] + 正则表达式。

仅匹配 161 其他格式 字符，包括著名的软连字符（ \u00AD ），零宽度空格（ \u200B ），零宽度非连接符（ \u200C ），零宽度连接符（ \u200D ），从左到右标记（ \u200E ）和从右到左的标记（ \u200F ）使用 \p {Cf} + 。包括星体位置代码点的等效项是（？：[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\ u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB] | \uD804 [\uDCBD\uDCCD] | \uD80D [ DCuDC30-\uDC38] | \uD82F [\uDCA0-\uDCA3] | \uD834 [\uDD73-\uDD7A] | \uDB40 [\uDC01\uDC20-\uDC7F] ）+ 正则表达式。

要匹配 137,468 其他私人使用控制代码点 ，您可以使用 \p {Co} + ，或其等价物（包括星体位置代码点），（？：[\uE000-\uF8FF] | [\uDB80-\uDBBE\uDBC0-\uDBFE] [\uDC00-＼＼uDFFF] | [\uDBBF\uDBFF] [\uDC00-\uDFFD]）+ 。

要匹配 2,048 其他代用代码点 包含一些表情符号，则可以使用 \p {Cs} + 或 [\uD800-\u DFFF] + 正则表达式。

To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

这篇关于C＃正则表达式在包含多种不同语言，Unicode字母的文本中删除不可打印的字符和控制字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C＃正则表达式在包含多种不同语言，Unicode字母的文本中删除不可打印的字符和控制字符 [英] C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

C＃正则表达式在包含多种不同语言，Unicode字母的文本中删除不可打印的字符和控制字符 [英] C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭