如何修复空格UTF编码？ [英] How to fix UTF encoding for whitespaces?

查看：129 发布时间：2016/9/18 13:29:53 c# unicode encoding utf-8 ascii

本文介绍了如何修复空格UTF编码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的C＃代码，我提取从PDF文档中的文本。当我这样做，我得到一个字符串，是在UTF-8或Unicode编码（我不知道哪）。当我使用 Encoding.UTF8.GetBytes（SRC）; 将其转换为一个字节数组，我注意到空白实际上是两个字符的194和160字节的值。

例如字符串CLE行动看起来像

  [ 67，76，69，194，160，65，99，116，105，111，110]

字节数组，其中空白是194和160 ......也正因为如此 src.IndexOf（CLE行动）; 将返回-1，当我需要它返回1。

我如何解决这个字符串？

解决方案

194 160 是 NO-BREAK SPACE 码点的UTF-8编码（相同代码点该HTML调用&放大器; NBSP; ）

所以，它真的不是一个空格，即使它看起来像之一。。（你会看到它不会自动换行，例如）为 \s 正则表达式匹配将匹配它，但用空格普通赢得比较。'T

要简单地取代NO-BREAK空间，你可以做到以下几点：

  SRC = src.Replace（'\\\ '，''）;

In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src); to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.

For example the string "CLE action" looks like

[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]

in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action"); is returning -1 when I need it to return 1.

How can I fix the encoding of the string?

解决方案

194 160 is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the same codepoint that HTML calls  ).

So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't.

To simply replace NO-BREAK spaces you can do the following:

src = src.Replace('\u00A0', ' ');

这篇关于如何修复空格UTF编码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何修复空格UTF编码？ [英] How to fix UTF encoding for whitespaces?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何修复空格UTF编码？ [英] How to fix UTF encoding for whitespaces?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭