在Python 3中使用开放的任意标记解析SGML [英] Parse SGML with Open Arbitrary Tags in Python 3

查看:265
本文介绍了在Python 3中使用开放的任意标记解析SGML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析一个文件,例如: http://www.sec.gov/Archives/edgar/数据/1409896/000118143112051484/0001181431-12-051484.hdr.sgml

I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml

我正在使用Python 3,并且无法使用现有库找到解决方案来解析带有开放标签的SGML文件. SGML允许隐式关闭标签.尝试使用LXML,XML或漂亮的汤解析示例文件时,我最终在文件末尾而不是在行尾关闭了隐式关闭的标签.

I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line.

例如:

<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>

这最终被解释为:

<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>
</ZIP>
</STREET>
</FORM>
</COMPANY>

但是,我需要将其解释为:

However, I need it to be interpreted as:

<COMPANY>Awesome Corp</COMPANY>  
<FORM> 24-7</FORM>
<ADDRESS>
<STREET>101 PARSNIP LN</STREET>
<ZIP>31337</ZIP>
</ADDRESS>

如果有一个非默认解析器传递给LXML/BS4可以处理此问题,我会丢失它.

If there's a non-default parser to pass to LXML/BS4 that can handle this I'm missing it.

推荐答案

如果您可以找到用于所处理文档的SGML DTD,则解决方案可以是将 osx SGML转换为XML OpenSP SGML工具包的转换器,将文档转换为XML.

If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML.

这是一个简单的例子.假设我们有以下SGML文档(company.sgml;带有根元素):

Here is a simple example. Let's say that we have the following SGML document (company.sgml; with a root element):

<!DOCTYPE ROOT SYSTEM "company.dtd">
<ROOT>
<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>

DTD(company.dtd)看起来像这样:

The DTD (company.dtd) looks like this:

<!ELEMENT ROOT       -  o (COMPANY, FORM, ADDRESS) >
<!ELEMENT COMPANY    -  o (#PCDATA) >
<!ELEMENT FORM       -  o (#PCDATA) >
<!ELEMENT ADDRESS    -  - (STREET, ZIP) >
<!ELEMENT STREET     -  o (#PCDATA) >
<!ELEMENT ZIP        -  o (#PCDATA) >

- o位表示可以省略结束标签.

The - o bit means that the end tag can be omitted.

可以使用 osx 解析SGML文档,并使用 xmllint 格式化输出,如下所示:

The SGML document can be parsed with osx, and the output can be formatted with xmllint, as follows:

osx company.sgml | xmllint --format -

上述命令的输出:

<?xml version="1.0"?>
<ROOT>
  <COMPANY>Awesome Corp</COMPANY>
  <FORM> 24-7</FORM>
  <ADDRESS>
    <STREET>101 PARSNIP LN</STREET>
    <ZIP>31337</ZIP>
  </ADDRESS>
</ROOT>

现在,我们拥有可以用lxml或其他XML工具处理的格式正确的XML.

Now we have well-formed XML that can be processed with lxml or other XML tools.

我不知道您链接到的文档是否有完整的DTD.以下PDF文件包含有关EDGAR的相关信息,包括可能有用的DTD: http: //www.sec.gov/info/edgar/pdsdissemspec910.pdf (我通过此答案找到了它).但是链接的SGML文档包含PDF文件中未提及的元素(例如,SEC-HEADER).

I don't know if there is a complete DTD for the document that you link to. The following PDF file contains related information about EDGAR, including a DTD that might be useful: http://www.sec.gov/info/edgar/pdsdissemspec910.pdf (I found it via this answer). But the linked SGML document contains elements (SEC-HEADER, for example) that are not mentioned in the PDF file.

这篇关于在Python 3中使用开放的任意标记解析SGML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆