如何在C#中获得准确的UTF8字符串 [英] How to get exact UTF8 string in C#

查看:219
本文介绍了如何在C#中获得准确的UTF8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个UTF8字符串('Sarandë')

我想要完全得到它但我只能得到这个('Sarandë')

我尝试使用Encoding与一些Internet的例子。但他们没有工作。

请帮助。



详情:我想在网站上获取字符串值('Sarandë')并添加到sqlite数据库。

我的字段类型是NVARCHAR(100)

但我插入的值是('Sarandë')。

我将它转换为bybe []并使用了Text.Encoding但它无法工作

我的页面我想获取数据: http://www.infodriveindia.com/traderesources/port.aspx?&GridInfo=Ports010 [ ^ ]

I have a UTF8 string ('Sarandë')
I want to get it exactly but I only get this tring ('Sarandë')
I try to use Encoding with some Internet's example. But they are not working.
Pls help.

Detail: I want to get string value ('Sarandë') on website and add to sqlite database.
My field type is NVARCHAR(100)
But Value I insert is ('Sarandë').
I converted it to bybe[] and used Text.Encoding but It's not working
My page I want to get data: http://www.infodriveindia.com/traderesources/port.aspx?&GridInfo=Ports010[^]

推荐答案

那里不是UTF字符串本身。 .NET中的字符串总是一个Unicode字符串,绝不是别的。



如果你有一个字符串并将其编码为UTF(任何UTF),那就是字节数组。不同的编码可以为您提供来自同一字符串的不同字节。 Unicode定义了一对一的对应字符(理解为文化实体,从图形字形和其他细节中抽象出来)和整数称为代码点(在抽象的数学意义上理解,不关心它们如何由计算机表示) 。 UTF定义代码点如何以字节表示。



现在,.NET的内部表示是UTF16LE,但所有API都是从这些信息中完全抽象出来的。在其他作品中,我会将其表述为基于 System.String 对象中字符串的任何特定表示的假设的任何程序不正确。



您需要使用 System.Text.Encoding 。也许你做错了。您所需要的只是了解Unicode是什么。希望我的解释能帮助你解决问题。







我想你想要看到像萨兰德这样的东西。但是你为什么需要呢?

如果你把这个单词作为UTF数据,将它保存为纯文本文件没有BOM 然后错误地将其打开为ANSI / ASCII文本。



因为字母'ë'以UTF-8的形式显示为两个字节(0xC3后跟0xAB),您可以将这两个字节看作ANSI,ASCII或非标准的扩展ASCII表示。这一切:

1)毫无意义;

2)并非总是可行。



但你可以通过使用字符串中的 Encoding.UTF8.GetBytes 来实现它(我认为你有): http://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspx [< a href =http://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspxtarget =_ blanktitle =New Window> ^ ] 。



然后按照你想要的方式解释每个,如ASCII或其他任何东西。但为什么? :-)



另请参阅: http:// www。 unicode.org/faq/utf_bom.html [ ^ ]。



-SA
There is no such thing as "UTF string" per se. String in .NET is always a Unicode string, never anything else.

If you have a string and encode it as UTF (any UTF), it's the array of bytes. Different encodings can give you different bytes from the same string. Unicode defines one-to-one correspondence characters (understood as cultural entities, abstracted from their graphical glyphs and other detail) and integers called "code points" (understood in their abstract mathematical sense, without any concerns of how they are represented by computers). UTFs define how code points are represented in bytes.

Now, .NET's internal representation is UTF16LE, but all API is full abstracted from this information. In other works, I would formulate it as "any program based on assumption of any particular presentation of the string in System.String object is incorrect".

You need to use System.Text.Encoding. Perhaps you are doing it wrong. All you need is the understanding of what Unicode is. Hope my explanation will help you to sort out your problem.



I think you wanted to see something like "Sarandë". But why would you need it?
This happens if you have this word as UTF data, save it as a plain text file without BOM and then mistakenly open it as ANSI/ASCII text.

As the letter 'ë' is presented in UTF-8 in two bytes (0xC3 followed by 0xAB), you can see those two bytes as ANSI, ASCII or non-standard "extended ASCII" presentation. It all:
1) makes no sense;
2) not always possible.

But you can do it by using Encoding.UTF8.GetBytes from a string (which I think you have): http://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspx[^].

And then interpret each by the way you want, as ASCII or anything else. But why? :-)

See also: http://www.unicode.org/faq/utf_bom.html[^].

—SA


参见http://www.w3schools.com/jsref/jsref_decodeuri.asp [ ^ ]。


由于一些滥用报告,您的下一个问题最近被自动删除了。我会再次回答。



在这个问题中,您询问了使用 HTML实体的HTML表示。其中一个答案解释了实体的API。我添加了背景说明:在我的解决方案1中(在此页面上)我解释了Unicode标准化并解释了什么是代码点。



现在,HTML字符实体与UTF无关。而不是使用字节编码代码点(例如,UTF-8,使用一些您不必知道的复杂算法,每个字符使用可变数量的字节),HTML字符实体编码代码点本身在这种情况下,#235。如果您运行CharMap.EXE(字符映射,与所有版本的Windows捆绑在一起的应用程序)并选择代码点0235(U + 00EB),您将看到字符'ë',带有Diaeresis的拉丁小写字母E。



我希望它能解释一下。



让我们看看:我解释了Unicode如何工作的基础知识, UTF如何工作以及HTML如何与字符实体一起工作。您需要在脑海中将它们全部放在一起,并且可能会阅读该主题,可能从 http://www.Unicode.org 开始[ ^ ]。



首先需要先了解一下,而不是试图解决一些非常想象的问题。



-SA
Your next question was recently auto-removed, due to some abuse reports. I'll answer again.

In that question, you asked about HTML representation using HTML entities. One of the answers explained the API for entities. I added the explanation of the background: in my Solution 1 (on this page) I explained what Unicode standardizes and explained what "code point" is.

Now, HTML character entity has nothing to do with UTFs. Instead of encoding code point with bytes (UTF-8, for example, uses variable number of bytes per characters using some intricate algorithm which you don't have to know), HTML character entity encodes code point itself, in this case, #235. If you run CharMap.EXE ("Character Map", the application bundled with all version of Windows) and select code point 0235 (U+00EB), you will see the character 'ë', "Latin Small Letter E With Diaeresis".

I hope it explains things.

Let's see: I explained the basics of how Unicode works, how UTFs work and how HTML works with character entities. You need to put it all together in your mind, and probably read on the topic, maybe starting from http://www.Unicode.org[^].

You need to come to some understanding first, instead of trying to solve some really imaginary problem.

—SA


这篇关于如何在C#中获得准确的UTF8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆