这是这个xml文件中的有效UTF8字符吗? [英] Is this a valid UTF8 character in this xml file?

查看:112
本文介绍了这是这个xml文件中的有效UTF8字符吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从上游数据源收到一些XML.

I have received some XML from an upstream data source.

我不确定这些奇怪的字符是否是有效的UTF8 -或-上游源已经搞砸了.即不良数据输入=>不良数据输出.

I'm not sure if these weird characters are valid UTF8 -or- the upstream source has messed things up. i.e. Bad data in => bad data out.

我想以下是传承的内容:

I'm guessing the following is what was passed down:

Value in XML file  | Unicode Value | UTF-8 Value  | English Description
-------------------------------------------------------------------------------------------
’ | U+2019        | \xe2\x80\x99 | RIGHT SINGLE QUOTATION MARK
• | U+2022        | \xe2\x80\xa3 | BULLET
&              | -not unicode- | --           | Ampsersand, HTML Encoded.

我觉得UFT-8值开头的\已被排序...但是..做错了吗?

i feel like the \ at the start of the UFT-8 value is sorta... encoded but .. done wrong?

有人可以解释一下我在看什么,所以我知道如何正确对其进行解码.同样令人沮丧的是,我觉得这可能是多种编码的混合,这会使事情变得很糟糕:(

Can someone please explain what I'm looking at, so I know how to correctly decode it. What's also frustrating is that i feel like this could be a mix of encodings which will make things awful :(

参考: http ://utf8-chartable.de/unicode-utf8-table.pl?start = 8192& number = 128& utf8 =字符串文字

推荐答案

这与您收到的XML中的UTF-8无关,因为&#xXX;的字符转义编码了字符,因此毫无疑问,编码是什么. [实际上,可能就是这样,可能是因为生成XML的任何东西都是由不了解XML转义是如何工作的人编写的.毕竟,一旦某些东西出现故障,就没有必要假设它可以正确执行任何操作,除非另行证明.]

It's not a matter of UTF-8 in the XML you receive because character escapes of the &#xXX; encode characters and so there's no question of what the encoding is. [Actually, it could be this, in that it could be that whatever is producing the XML was written by someone who doesn't understand how XML escapes are meant to work. After all, once something is buggy, there's no point assuming it does anything correctly until proven otherwise.]

它看起来确实像某种方式对待了一些非常好的UTF-8,就好像它是一种不同的编码一样,然后决定对结果进行转义.由于此而得到的某些字符("U + 0080"和"U + 0099")是XML允许的字符,但强烈建议不要使用.有些('â'和'¢')是完全明智的字符(尽管是以不明智的方式产生的),使得逃避该错误的决定几乎与导致他们在那里的任何错误一样奇怪.

It does look like something along the way has treated some perfectly good UTF-8 as if it was a different encoding, then decided to escape the results. Some of the characters you are getting as a result of this ('U+0080' and 'U+0099') are characters that are allowed in XML but strongly discouraged. Some ('â' and '¢') are perfectly sensible characters (though produced in non-sensible ways) that makes the decision to escape it nearly as strange as whatever mistake led to their being there.

无论mojibake的来源是什么,您都会得到mojibake,因此,如果您可以在上游抱怨或报告错误,请这样做并在源头进行修复,而不是尝试修复损坏的问题.

Whatever the source of the mojibake, you're getting mojibake, so if you can complain or report a bug upstream, do so and have it fixed at source rather than trying to fix something that is broken.

否则,您将不得不尝试对字符进行转义,对它们进行编码,就好像它们以他们认为的格式一样(我想是ISO Latin 1,但还有其他可能),然后像对它们进行解码一样他们是UTF-8.不能保证不会对文档的正确部分造成太大的损害,尽管它会撤消对错误的部分的损害.

Otherwise you're going to have to try to unescape the characters, encode them as if they were whatever format they thought they were (I'd guess ISO Latin 1, but there are other possibilities) and then decode them as if they were UTF-8. There's no promise that that won't do just as much damage to a correct bit of the document as it undoes to that buggy bit though.

这篇关于这是这个xml文件中的有效UTF8字符吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆