在Javascript中将ISO / Windows字符集转换为UTF-8 [英] Convert ISO/Windows charsets to UTF-8 in Javascript

查看:98
本文介绍了在Javascript中将ISO / Windows字符集转换为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个firefox插件,我抓取网页做一些分析的用户。问题是,当我尝试获取(XMLHttpRequest)页面不是utf-8编码的字符串,我看到被搞砸了。例如,希伯来语页面为windows-1125或中文页面为gb2312。



我已经尝试过以下操作:

  var uDecoder = Components。 classes [@ mozilla.org/intl/scriptableunicodeconverter\"].getService(Components.interfaces.nsIScriptableUnicodeConverter); 
uDecoder.charset =windows-1255;
alert(xhr.responseText);

var decoder = Components.classes [@ mozilla.org/intl/utf8converterservice;1\"].getService(Components.interfaces.nsIUTF8ConverterService);

alert(decoder.convertStringToUTF8(xhr.responseText,WINDOWS-1255,true));



我也试过了 escape / unescape / encodeURIComponent



任何想法

解决方案

c> XMLHttpRequest 尝试使用UTF-8解码非UTF-8字符串,您已经丢失了。页面中不是有效UTF-8序列的字节序列将被压缩(通常转换为 ,即U + FFFD替换字符)。



指定的页面Content-Type:text / html; charset = something HTTP头应该OK。没有真正HTTP标头但却有< meta> 版本的网页不会是,因为 XMLHttpRequest 不知道解析HTML,所以它不会看到元。如果你事先知道你想要的字符集,你可以告诉 XMLHttpRequest ,它会使用它:

  xhr.open(...); 
xhr.overrideMimeType('text / html; charset = gb2312');
xhr.send();

(这是目前非标准化的Mozilla扩展程序。)



如果你不提前知道字符集,你可以请求页面一次,使用< meta> 字符集的头部,



理论上,你可以在单个请求中得到一个二进制响应:

  xhr.overrideMimeType('text / html; charset = iso-8859-1'); 

然后将其从bytes-as-chars转换为UTF-8。但是, iso-8859-1 将无法工作,因为浏览器将该字符集解释为真正是Windows代码页1252。 >

您可以使用另一个代码页将每个字节映射到一个字符,并做一个繁琐的字符替换,以将该代码页中的每个字符映射到真正的字符-ISO-8859-1,然后做转换。大多数编码不会映射每个字节,但阿拉伯语(cp1256)可能是这个的候选者?


I'm developing a firefox plugin and i fetch web pages to do some analysis for the user. The problem is when i try to get (XMLHttpRequest) pages that are not utf-8 encoded the string i see is messed up. For example hebrew pages with windows-1125 or Chinese pages with gb2312.

I already tried the following:

var uDecoder=Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].getService(Components.interfaces.nsIScriptableUnicodeConverter);
uDecoder.charset="windows-1255";
alert( xhr.responseText );

var decoder=Components.classes["@mozilla.org/intl/utf8converterservice;1"].getService(Components.interfaces.nsIUTF8ConverterService);

alert(decoder.convertStringToUTF8(xhr.responseText,"WINDOWS-1255",true)); 

I also tried escape/unescape/encodeURIComponent

any ideas???

解决方案

Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to , the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.

Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:

xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();

(This is a currently non-standardised Mozilla extension.)

If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.

In theory you could get a binary response in a single request:

xhr.overrideMimeType('text/html;charset=iso-8859-1');

and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.

You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

这篇关于在Javascript中将ISO / Windows字符集转换为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆