Java XMLReader不清除多字节UTF-8编码的属性 [英] Java XMLReader not clearing multi-byte UTF-8 encoded attributes

查看:255
本文介绍了Java XMLReader不清除多字节UTF-8编码的属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常奇怪的情况,我的SAX ContentHandler由XMLReader处理错误的属性。要解析的文档是UTF-8,其中包含XML属性中的多字节字符。似乎发生的是,每次调用我的处理程序时,这些属性都在累积。因此,不是连续传递,而是连接到上一个节点的值。

I've got a really strange situation where my SAX ContentHandler is being handed bad Attributes by XMLReader. The document being parsed is UTF-8 with multi-byte characters inside XML attributes. What appears to happen is that these attributes are being accumulated each time my handler is called. So rather than being passed in succession, they get concatenated onto the previous node's value.

这里是一个使用公共数据(维基百科)

Here is an example which demonstrates this using public data (Wikipedia).

public class MyContentHandler extends org.xml.sax.helpers.DefaultHandler {

    public static void main(String[] args) {
        try {
            org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();
            reader.setContentHandler(new MyContentHandler());
            reader.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apfilterredir=redirects&apdir=descending");

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    public void startElement(String uri, String localName, String qName, org.xml.sax.Attributes attributes) {
        if ("p".equals(qName)) {
            String title = attributes.getValue("title");
            System.out.println(title);
        }
    }
}

更新: / strong>这个完整的例子产生(对任何广东话的俚语输出道歉):

Update: This complete example produces (apologies to any Cantonese speakers for the vulgar output):

这篇关于Java XMLReader不清除多字节UTF-8编码的属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆