Sax解析和编码 [英] Sax parsing and encoding

查看:131
本文介绍了Sax解析和编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在解析RSS和Atom文件时遇到SAX问题的联系人。根据他的说法,就好像来自Item元素的文本被截断为撇号或有时是重音字符。编码似乎也有问题。

I have a contact that is experiencing trouble with SAX when parsing RSS and Atom files. According to him, it's as if text coming from the Item elements is truncated at an apostrophe or sometimes an accented character. There seems to be a problem with encoding too.

我已经尝试过SAX了,我也有一些截断,但是还没有进一步深入挖掘。如果有人在此之前解决了这个问题,我会很感激。

I've given SAX a try and I have some truncating taking place too but haven't been able to dig further. I'd appreciate some suggestions if someone out there has tackled this before.

这是在ContentHandler中使用的代码:

This is the code that's being used in the ContentHandler:

public void characters( char[], int start, int end ) throws SAXException {
//
    link = new String(ch, start, end);

编辑:编码问题可能是因为我知道Java可以在字节数组中存储信息Unicode。

The encoding problem might be due to storing information in a byte array as I know Java works in Unicode.

推荐答案

不保证characters()方法在一次传递中为您提供文本元素的完整字符内容 - 全文可能跨越缓冲区边界。您需要在开始和结束元素事件之间自己缓冲字符。

The characters() method is not guaranteed to give you the complete character content of a text element in one pass - the full text may span buffer boundaries. You need to buffer the characters yourself between the start and end element events.

例如

StringBuilder builder;

public void startElement(String uri, String localName, String qName, Attributes atts) {
   builder = new StringBuilder();
}

public void characters(char[] ch, int start, int length) {
   builder.append(ch,start,length);
}

public void endElement(String uri, String localName, String qName) {
  String theFullText = builder.toString();
}

这篇关于Sax解析和编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆