如何使用 SAX 获取 xml 标记的正确开始/结束位置? [英] How do I get the correct starting/ending locations of a xml tag with SAX?

查看:39
本文介绍了如何使用 SAX 获取 xml 标记的正确开始/结束位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SAX 中有一个定位器,它会跟踪当前位置.但是,当我在 startElement() 中调用它时,它总是返回 xml 标记的结束位置.

There is a Locator in SAX, and it keep track of the current location. However, when I call it in my startElement(), it always returns me the ending location of the xml tag.

如何获取标签的起始位置?有什么办法可以优雅的解决这个问题吗?

How can I get the starting location of the tag? Is there any way to gracefully solve this problem?

推荐答案

不幸的是,org.xml.sax 包中 Java 系统库提供的 Locator 接口根据定义,不允许提供有关文档位置的更多详细信息.引用文档 getColumnNumber 方法(我添加的亮点):

Unfortunately, the Locator interface provided by the Java system library in the org.xml.sax package does not allow for more detailed information about the documentation location by definition. To quote from the documentation of the getColumnNumber method (highlights added by me):

该方法的返回值仅作为近似值用于诊断;它不打算提供足够的信息来编辑原始 XML 文档的字符内容.例如,当行包含组合字符序列、宽字符、代理对或双向文本时,该值可能与文本编辑器显示中的列不对应.

The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.

根据该规范,基于 SAX 驱动程序的最大努力,您将始终获得与文档事件关联的文本之后的第一个字符"的位置.因此,对您问题第一部分的简短回答是:不,定位器 不提供有关标签起始位置的信息.此外,如果您正在处理文档中的多字节字符,例如中文或日文文本,则您从 SAX 驱动程序获得的位置可能不是您想要的.

According to that specification, you will always get the position "of the first character after the text associated with the document event" based on best effort by the SAX driver. So the short answer to the first part of your question is: No, the Locator does not provide information about the start location of a tag. Also, if you are dealing with multi-byte characters in your documents, e.g., Chinese or Japanese text, the position you get from the SAX driver is probably not what you want.

如果您需要标签的确切位置,或者想要获得关于属性、属性内容等的更细粒度的信息,则必须实现自己的位置提供程序.

If you are after exact positions for tags, or want even more fine grained information about attributes, attribute content etc., you'd have to implement your own location provider.

由于涉及所有潜在的编码问题、Unicode 字符等,我想这是一个太大的项目,无法在此处发布,实现还取决于您的具体要求.

With all the potential encoding issues, Unicode characters etc. involved, I guess this is too big of a project to post here, the implementation will also depend on your specific requirements.

只是来自个人经验的快速警告:围绕您传递给 SAX 解析器的 InputStream 编写包装器是危险的,因为您不知道 SAX 解析器何时会根据它报告它的事件已经从流中读取.

Just a quick warning from personal experience: Writing a wrapper around the InputStream you pass into the SAX parser is dangerous as you don't know when the SAX parser will report it's events based on what it has already read from the stream.

您可以通过检查换行符、制表符,在 ContentHandlercharacters(char[], int, int) 方法中进行一些自己的计数除了使用 Locator 信息之外,还可以使用其他信息,这应该可以让您更好地了解您在文档中的实际位置.通过记住上一个事件的位置,您可以计算出当前事件的开始位置.但请注意,您可能看不到所有换行符,因为这些换行符可能出现在您在 characters 中看不到的标签内,但您可以从 Locator 中推断出这些换行符信息.

You could start by doing some counting of your own in the characters(char[], int, int) method of your ContentHandler by checking for line breaks, tabs etc. in addition to using the Locator information, which should give you a better picture of where in the document you actually are. By remembering the positions of the last event you could calculate the start position of the current one. Take into account though, that you might not see all line breaks, as those could appear inside tags which you would not see in characters, but you could deduce those from the Locator information.

这篇关于如何使用 SAX 获取 xml 标记的正确开始/结束位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆