如何在使用Javascript的浏览器中解析非UTF8 XML? [英] How to parse non-UTF8 XML in browsers with Javascript?

查看:72
本文介绍了如何在使用Javascript的浏览器中解析非UTF8 XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用big5编码的XML字符串:

I have a XML string encoded in big5:

atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+')

(在UTF-8中为<?xml version="1.0" encoding="big5" ?><title>中文</title>.)

(<?xml version="1.0" encoding="big5" ?><title>中文</title> in UTF-8.)

我想提取<title>的内容.如何在浏览器中使用纯Javascript做到这一点?最好有没有jquery或emscripten的轻量级解决方案.

I'd like to extract the content of <title>. How can I do that with pure Javascript in browsers? Better to have lightweight solutions without jquery or emscripten.

尝试过DOMParser:

(new DOMParser()).parseFromString(atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+'), 'text/xml')

但是Chromium和Firefox都不尊重编码属性. DOMParser仅支持UTF-8是一种标准吗?

But neither Chromium nor Firefox respects the encoding attribute. Is it a standard that DOMParser supports UTF-8 only?

推荐答案

我怀疑问题不是DOMParser,而是atob,该问题无法正确解码最初是非ascii字符串的内容.*

I suspect the issue isn't DOMParser, but atob, which can't properly decode what was originally a non-ascii string.*

您将需要使用另一种方法来获取原始字节,例如使用

You will need to use another method to get at the original bytes, such as using https://github.com/danguer/blog-examples/blob/master/js/base64-binary.js

var encoded = 'PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+';
var bytes = Base64Binary.decode(encoded);

,然后是一些将字节(即解码big5数据)转换为Javascript字符串的方法.对于Firefox/Chrome,您可以使用TextDecoder:

and then some method to convert the bytes (i.e. decode the big5 data) into a Javascript string. For Firefox / Chrome, you can use TextDecoder:

var decoder = new TextDecoder('big5'); 
var decoded = decoder.decode(bytes);

然后传递给DOMParser

var dom = (new DOMParser()).parseFromString(decoded, 'text/xml');
var title = dom.children[0].textContent;

您可以在 https://plnkr.co/edit/TBspXlF2vNbNaKK8UxhW?p=preview中看到此内容

*理解原因的一种方式:atob不将原始字符串的编码作为参数,因此,尽管必须在内部将base64编码的数据解码为字节,但必须假设对那些字符进行编码字节将为您提供一个Java脚本字符串,我相信该字符串在内部被编码为UTF-16.

*One way of understanding why: atob doesn't take the encoding of the original string as a parameter, so while it must internally decode base64 encoded data to bytes, it has to make an assumption on what character encoding those bytes are to then give you a Javascript string of characters, which I believe is internally encoded as UTF-16.

这篇关于如何在使用Javascript的浏览器中解析非UTF8 XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆