解析格式不佳的SGML/XML文档的LOTS和LOTS的策略 [英] Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents

查看:102
本文介绍了解析格式不佳的SGML/XML文档的LOTS和LOTS的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有成千上万的SGML文档,有些格式良好,有些格式则不太理想.我需要了解文档中的某些ELEMENTS,但是每次我加载并尝试将它们读入XDocument,XMLDocument或什至只是StreamReader时,都会遇到各种XMLException错误.

I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors.

诸如'['是意外令牌"之类的东西.为什么?因为我有一个像DOCTYPE这样的文档

Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like

<!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] >

,我了解到"[]"内部需要有一些有效的内容.同样,我不控制文档的创建,但是我必须破解"它们并获得所需的数据.另一个示例是具有未封闭"的ELEMENT,例如:

and I have learned that the "[]" needs to have something valid inside. Again, I don't control the creation of the documents, but I DO HAVE to "crack" them and get at the data I want. Another example is having an "unclosed" ELEMENT, for example:

<Caption>Plants, and facilities<hardhyphen><hyphen>Inspection.</Caption>

此XMLException是第27行的'连字符'开始标记与'Caption'的结束标记不匹配.第27行的位置58."很明显,对吧?

This XMLException is "The 'hyphen' start tag on line 27 does not match the end tag of 'Caption'. Line 27, position 58." Obvious, right?

但是接下来的问题是,如何在不遇到XMLExceptions的情况下真正获得这些文档中的某些ELEMENTS. SAX解析器是正确的方法吗?我基本上想打开文档,直接转到我想要的元素(不必担心附近可能格式正确或不正确的格式),提取数据,然后继续.我应该忘了用XMLDocument,XDocument进行解析,而只是做简单的字符串替换,例如

But then the question is how can you actually get at certain ELEMENTS in these documents, without encountering XMLExceptions. Is a SAX parser the right way? I basically want to open the document, go right to the element I want (without worrying what might or might not be well-formed nearby), pull the data, and move on. Should I just forget parsing with XMLDocument, XDocument, and just do simple string replacements like

str.Replace("<hardhypen><hyphen>", "-")

,然后尝试将其加载到XML解析器之一中.关于策略的任何提示吗?

and then try to load it into one of the XML parsers. Any tips on strategies?

推荐答案

问题是您正在尝试使用XML工具解析SGML.他们不一样.如果要使用XML工具/语言访问数据,则可能需要先将SGML转换为XML,然后再尝试对其进行解析.

The issue is that you're trying to parse SGML with an XML tool. They're not the same. If you want to use an XML tool/language to access the data, you will probably need to convert the SGML to XML before trying to parse it.

理想情况下,您要么使用支持SGML的语言/工具(例如OmniMark),要么使用可以处理"XML一样"数据的东西(例如第一个答案中的nokogiri?).

Ideally you'd either use a language/tool that supports SGML (like OmniMark) or something that can handle "XML like" data (like nokogiri from the first answer?).

这可能很简单,但在某些时候可能会变得棘手.尤其是在谈论多种文档类型(DTD)的情况下. (此外,不存在格式正确的" SGML之类的东西.是的,元素/等必须正确嵌套,但SGML 必须具有DTD.)

This can be pretty straight forward, but can get tricky at some points. Especially if you're talking about multiple doctypes (DTD's). (Also, there's no such thing as "well-formed" SGML. Yes, the elements/etc. have to be nested correctly but SGML has to have a DTD.)

这是您需要处理的SGML和XML之间的一些区别. (您可能不想走这条路,但无论如何对于提供信息而言可能会有所帮助.):

Here are some differences between SGML and XML that you'd need to handle. (You may not want to go this route, but it may be helpful for informational purposes anyway.):

  1. DOCTYPE声明

您的示例中的DOCTYPE声明是一个完全有效的SGML doctype. [](内部子集)不必包含任何内容.如果确实在内部子集中有声明(通常是实体声明),则很有可能必须在XML中保留文档类型声明.

The DOCTYPE declaration in your example is a perfectly valid SGML doctype. The [] (internal subset) doesn't have to have anything in it. If you do have declarations in the internal subset (usually entity declarations), you're more than likely going to have to keep a doctype declaration in the XML.

XML解析器遇到的问题是,声明中没有系统标识符.在XML doctype声明中,如果存在公共标识符,则需要系统标识符.在SGML doctype声明中,它不是必需的.

The issue the XML parser is having is that you don't have a system identifier in the declaration. In an XML doctype declaration, the system identifier is required if there is a public identifier. In an SGML doctype declaration, it's not required.

底线:除非您需要XML解析为DTD/Schema或内部子集中有声明,否则请去除doctype声明.如果XML必须有效,则至少需要添加一个系统标识符.不要忘记添加<?xml ...?>处理指令.

Bottom line: unless you need the XML to parse to a DTD/Schema or have declarations in the internal subset, strip the doctype declaration. If the XML does have to be valid, you'll at least need to add a system identifier. Don't forget to add the <?xml ...?> processing instruction.

没有结束标记的元素

<hardhyphen><hyphen>元素是有效的SGML. SGML DTD允许您指定标签最小化.这意味着您可以指定是否需要结束标签. (您也可以将start标签设为可选,但这是个疯狂的话题.)在XML中,您必须关闭这些元素(例如<hardhyphen/><hardhyphen></hardhyphen>)

The <hardhyphen> and <hyphen> elements are valid SGML. SGML DTD's allow you to specify tag minimization. What this means is that you can specify whether or not an end tag is required. (You can also make the start tag optional, but that's crazy talk.) In XML you have to close these elements (like <hardhyphen/> or <hardhyphen></hardhyphen>)

最好的办法是查看SGML DTD,看看哪些元素具有可选的结束标记.标签最小化是在元素声明中紧随元素名称之后指定的. -"表示标签是必需的. "o"(字母"oh")表示该标记是可选的.例如,如果看到<!ELEMENT hyphen - o (#PCDATA)>,则表示开始标记是必需的(-),结束标记是可选的(o).如果看到<!ELEMENT hyphen - - (#PCDATA)>,则开始标签和结束标签都是必需的.

The best thing to do is to look at your SGML DTD and see what elements have optional end tags. The tag minimization is specified right after the element name in the element declaration. A '-' means the tag is required. An 'o' (letter 'oh') means that the tag is optional. For example if you see <!ELEMENT hyphen - o (#PCDATA)>, this means that the start tag is required (-) and the end tag is optional (o). If you see <!ELEMENT hyphen - - (#PCDATA)>, both the start and the end tags are required.

底线:正确关闭所有没有结束标签的元素

Bottom line: properly close all of the elements that don't have end tags

处理说明

处理指令(PI)在关闭时没有第二个?.您需要添加第二个?.

Processing instructions (PI's) in SGML don't have the second ? when they are closed like XML does. You'll need to add the second ?.

SGML PI示例:<?asdf jkl>

Example SGML PI: <?asdf jkl>

示例XML PI:<?asdf jkl?>

包含/排除

您可能不必担心这一点,但是在SGML DTD中,您可以在元素声明中指定在该元素内部的任何地方都允许(或不允许)另一个元素.如果目标XML需要解析为DTD,这可能会很麻烦. XML DTD不允许包含/排除.

You probably won't have to worry about this, but in an SGML DTD you can specify in an element declaration that another element is allowed anywhere inside of that element (or not allowed). This can be a pain if your target XML needs to parse to a DTD; XML DTD's do not allow inclusions/exclusions.

这是一个包含物的样子:

This is what an inclusion might look like:

<!ELEMENT chapter - - (section)+ +(revst|revend)>

这是说revst内的任何地方都允许使用revstrevend.如果元素声明具有-(revst|revend),则意味着在chapter内的任何地方 都不允许revstrevend.

This is saying that revst or revend are allowed anywhere inside of chapter. If the element declaration had -(revst|revend), it would mean that revst or revend is not allowed anywhere inside of chapter.

希望这会有所帮助.

这篇关于解析格式不佳的SGML/XML文档的LOTS和LOTS的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆