如何修复空格UTF编码? [英] How to fix UTF encoding for whitespaces?

查看:129
本文介绍了如何修复空格UTF编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的C#代码,我提取从PDF文档中的文本。当我这样做,我得到一个字符串,是在UTF-8或Unicode编码(我不知道哪)。当我使用 Encoding.UTF8.GetBytes(SRC); 将其转换为一个字节数组,我注意到空白实际上是两个字符的194和160字节的值。



例如字符串CLE行动看起来像

  [ 67,76,69,194,160,65,99,116,105,111,110] 

字节数组,其中空白是194和160 ......也正因为如此 src.IndexOf(CLE行动); 将返回-1,当我需要它返回1。



我如何解决这个字符串?


解决方案

194 160 NO-BREAK SPACE 码点的UTF-8编码(相同代码点该HTML调用&放大器; NBSP;



所以,它真的不是一个空格,即使它看起来像之一。 。 (你会看到它不会自动换行,例如)为 \s 正则表达式匹配将匹配它,但用空格普通赢得比较。'T



要简单地取代NO-BREAK空间,你可以做到以下几点:

  SRC = src.Replace('\\\ ',''); 


In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src); to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.

For example the string "CLE action" looks like

[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]

in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action"); is returning -1 when I need it to return 1.

How can I fix the encoding of the string?

解决方案

194 160 is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the same codepoint that HTML calls  ).

So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't.

To simply replace NO-BREAK spaces you can do the following:

src = src.Replace('\u00A0', ' ');

这篇关于如何修复空格UTF编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆