如何修复空格UTF编码? [英] How to fix UTF encoding for whitespaces?
问题描述
在我的C#代码,我提取从PDF文档中的文本。当我这样做,我得到一个字符串,是在UTF-8或Unicode编码(我不知道哪)。当我使用 Encoding.UTF8.GetBytes(SRC);
将其转换为一个字节数组,我注意到空白实际上是两个字符的194和160字节的值。
例如字符串CLE行动看起来像
[ 67,76,69,194,160,65,99,116,105,111,110]
字节数组,其中空白是194和160 ......也正因为如此 src.IndexOf(CLE行动);
将返回-1,当我需要它返回1。
我如何解决这个字符串?
194 160
是 NO-BREAK SPACE
码点的UTF-8编码(相同代码点该HTML调用&放大器; NBSP;
)
所以,它真的不是一个空格,即使它看起来像之一。 。 (你会看到它不会自动换行,例如)为 \s
正则表达式匹配将匹配它,但用空格普通赢得比较。'T
要简单地取代NO-BREAK空间,你可以做到以下几点:
SRC = src.Replace('\\\ ','');
In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src);
to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.
For example the string "CLE action" looks like
[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]
in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action");
is returning -1 when I need it to return 1.
How can I fix the encoding of the string?
194 160
is the UTF-8 encoding of a NO-BREAK SPACE
codepoint (the same codepoint that HTML calls
).
So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s
would match it, but a plain comparison with a space won't.
To simply replace NO-BREAK spaces you can do the following:
src = src.Replace('\u00A0', ' ');
这篇关于如何修复空格UTF编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!