如何使用 javascript 将特殊的 UTF-8 字符转换为等效的 iso-8859-1? [英] How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

查看:22
本文介绍了如何使用 javascript 将特殊的 UTF-8 字符转换为等效的 iso-8859-1?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个 javascript 应用程序,它使用 jquery 检索 .json 文件并将数据注入它嵌入的网页中.

I'm making a javascript app which retrieves .json files with jquery and injects data into the webpage it is embedded in.

.json 文件使用 UTF-8 编码并包含重音字符,如 é、ö 和 å.

The .json files are encoded with UTF-8 and contains accented chars like é, ö and å.

问题是我无法控制将要使用该应用程序的页面上的字符集.

The problem is that I don't control the charset on the pages that are going to use the app.

有些将使用 UTF-8,但其他将使用 iso-8859-1 字符集.这当然会使 .json 文件中的特殊字符乱码.

Some will be using UTF-8, but others will be using the iso-8859-1 charset. This will of course garble the special chars from the .json files.

如何使用 javascript 将特殊的 UTF-8 字符转换为等效的 iso-8859-1?

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

推荐答案

实际上,所有内容通常都在内部存储为某种 Unicode,但我们不要深入讨论.我假设您得到标志性的åäö"类型字符串,因为您使用 ISO-8859 作为字符编码.您可以使用一个技巧来转换这些字符.用于编码和解码查询字符串的 escapeunescape 函数是为 ISO 字符定义的,而较新的 encodeURIComponentdecodeURIComponent 做同样的事情,是为 UTF8 字符定义的.

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "åäö" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF8 characters.

escape 将扩展的 ISO-8859-1 字符(UTF 代码点 U+0080-U+00ff)编码为 %xx(两位十六进制),而它编码UTF 代码点 U+0100 及以上为 %uxxxx(%u 后跟四位十​​六进制.)例如,escape("å") == "%E5"escape("あ") == "%u3042".

escape encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx (%u followed by four-digit hex.) For example, escape("å") == "%E5" and escape("あ") == "%u3042".

encodeURIComponent 将扩展字符百分比编码为 UTF8 字节序列.例如,encodeURIComponent("å") == "%C3%A5"encodeURIComponent("あ") == "%E3%81%82".

encodeURIComponent percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5" and encodeURIComponent("あ") == "%E3%81%82".

所以你可以这样做:

fixedstring = decodeURIComponent(escape(utfstring));

例如,错误编码的字符å"变成了Ã¥".该命令执行 escape("Ã¥") == "%C3%A5" 这是编码为单个字节的两个不正确的 ISO 字符.然后 decodeURIComponent("%C3%A5") == "å",其中两个百分比编码的字节被解释为 UTF8 序列.

For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5" which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

如果您出于某种原因需要做相反的事情,那也可以:

If you'd need to do the reverse for some reason, that works too:

utfstring = unescape(encodeURIComponent(originalstring));

有没有办法区分错误的 UTF8 字符串和 ISO 字符串?原来有.如果给定格式错误的编码序列,上面使用的 decodeURIComponent 函数将抛出错误.我们可以用它来检测我们的字符串是 UTF8 还是 ISO 的可能性很大.

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

var fixedstring;

try{
    // If the string is UTF-8, this will work and not throw an error.
    fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
    // If it isn't, an error will be thrown, and we can assume that we have an ISO string.
    fixedstring=badstring;
}

这篇关于如何使用 javascript 将特殊的 UTF-8 字符转换为等效的 iso-8859-1?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆