Lenient XML python parser:解决xml标签重叠 [英] Lenient XML python parser: Resolve xml tags overlap

查看:40
本文介绍了Lenient XML python parser:解决xml标签重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找错误友好的(在 BeautifulSoup 的术语中是宽松的)错误"XML 输入的 python 解析器.问题是标签重叠.一个示例输入是:

I am looking for mistakes friendly (lenient in BeautifulSoup's terminology) python parser of "bad" XML input. The problem is tags overlap. An example input is:

<trn>choya - <i><b>a cholla cactus </i> lat. <i>Cylindropuntia</b></trn></i>

我想得到什么和符合 XML 的结果,例如(好的我希望的结果)

What I would like to get and XML-complient result such as (good result I wish)

<trn>choya - <b><i>a cholla cactus </i> lat. <i>Cylindropuntia</i></b></trn>

带有 html.parserhtml5libBeautifulSoup 给了我别的东西( 结果我没有想要):

The BeautifulSoup with html.parser or html5lib gives me something else (bad result I don't want):

<trn>choya - <i><b>a cholla cactus </b></i> lat. <i>Cylindropuntia</i></trn>

注意 标签的顺序.如果我将 标记为斜体, 标记为粗体,好的 答案是

Pay attention to the sequence of the <i> and <b> tags. If I'll mark <i> as italic, and <b> as bold, the good answer is

choya - 仙人掌 纬度 Cylindropuntia

choya - a cholla cactus lat. Cylindropuntia

不好的答案是

choya - cholla 仙人掌 纬度.Cylindropuntia

我也试过旧的tidyhtml,没有得到必要的结果.对于新的 tidy-html5 找不到 python 接口.你能帮我吗,要么

I tried also old tidyhtml, couldn't get a necessary result. And for new tidy-html5 could not find a python interface. Can you help me please, either

  • 找到一个可以完成这项工作的解析器
  • 如果没有,请为算法或与此类算法相关的任何知识来源提供建议

谢谢!

推荐答案

html.parser.HTMLParser 擅长解析标签汤,SAX XMLGenerator 类有一个基于事件生成 XML 的便捷 API.

html.parser.HTMLParser is good at parsing tag soup, and the SAX XMLGenerator class has a convenient API to generate XML based on events.

并不是所有的位都在这里实现,特别是标签的刚性"/重量"约束(现在我们所做的只是关闭标签,我们期望它是正确的嵌套),但基本的想法似乎可行.

Not all of the bits are implemented here, especially not the "rigidness"/"weight" constraints for the tags (right now all we do is just close the tag with what we expect it to be to make nesting correct), but the basic idea seems to work.

输出是

<trn>choya - <i><com>a cholla cactus </com> lat. <i>Cylindropuntia</i></i> native to US</trn>

这是有效的 XML,嵌套.

which is valid XML, nesting-wise.

祝你好运!

import html.parser
import io
from xml.sax.saxutils import XMLGenerator


class Reconstructor(html.parser.HTMLParser):

    def __init__(self):
        super().__init__()
        self.op_stream = []
        self.tag_stack = []

    def handle_startendtag(self, tag, attrs):
        self.op_stream.append(('startendtag', (tag, attrs)))

    def handle_starttag(self, tag, attrs):
        self.op_stream.append(('starttag', (tag, attrs)))
        self.tag_stack.append(tag)

    def handle_endtag(self, tag):
        expected_tag = self.tag_stack[-1]
        if tag != expected_tag:
            print('mismatch closing <{}>, expected <{}>'.format(tag, expected_tag))
            # TODO: implement logic to figure out the correct order for the tags here
            #       and reorder tag_stack accordingly.
        stack_tag = self.tag_stack.pop(-1)
        self.op_stream.append(('endtag', (stack_tag, tag)))

    def handle_charref(self, name):
        self.op_stream.append(('charref', (name,)))

    def handle_entityref(self, name):
        self.op_stream.append(('entityref', (name,)))

    def handle_data(self, data):
        self.op_stream.append(('data', (data,)))

    def handle_comment(self, data):
        self.op_stream.append(('comment', (data,)))

    def handle_decl(self, decl):
        self.op_stream.append(('decl', (decl,)))

    def handle_pi(self, data):
        self.op_stream.append(('pi', (data,)))

    def generate_xml(self):
        stream = io.StringIO()
        xg = XMLGenerator(stream, encoding='utf-8')
        for op, args in self.op_stream:
            if op in ('startendtag', 'starttag'):
                tag, attrib = args
                xg.startElement(tag, dict(attrib))
                if op == 'startendtag':
                    xg.endElement(tag)
            elif op == 'endtag':
                tag = args[0]
                xg.endElement(tag)
            elif op == 'data':
                xg.characters(args[0])
            else:
                raise NotImplementedError('Operator not implemented: %s' % op)
        xg.endDocument()
        return stream.getvalue()


xr = Reconstructor()
xr.feed('<trn>choya - <i><com>a cholla cactus </i> lat. <i>Cylindropuntia</com></trn> native to US</i>')
y = xr.generate_xml()
print(y)

这篇关于Lenient XML python parser:解决xml标签重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆