Lenient XML python parser:解决xml标签重叠 [英] Lenient XML python parser: Resolve xml tags overlap
问题描述
我正在寻找错误友好的(在 BeautifulSoup 的术语中是宽松的)错误"XML 输入的 python 解析器.问题是标签重叠.一个示例输入是:
I am looking for mistakes friendly (lenient in BeautifulSoup's terminology) python parser of "bad" XML input. The problem is tags overlap. An example input is:
<trn>choya - <i><b>a cholla cactus </i> lat. <i>Cylindropuntia</b></trn></i>
我想得到什么和符合 XML 的结果,例如(好的我希望的结果)
What I would like to get and XML-complient result such as (good result I wish)
<trn>choya - <b><i>a cholla cactus </i> lat. <i>Cylindropuntia</i></b></trn>
带有 html.parser
或 html5lib
的 BeautifulSoup
给了我别的东西(坏 结果我没有想要):
The BeautifulSoup
with html.parser
or html5lib
gives me something else (bad result I don't want):
<trn>choya - <i><b>a cholla cactus </b></i> lat. <i>Cylindropuntia</i></trn>
注意 和
标签的顺序.如果我将
标记为斜体,
标记为粗体,好的 答案是
Pay attention to the sequence of the <i>
and <b>
tags. If I'll mark <i>
as italic, and <b>
as bold, the good answer is
choya - 仙人掌 纬度 Cylindropuntia
choya - a cholla cactus lat. Cylindropuntia
不好的答案是
choya - cholla 仙人掌 纬度.Cylindropuntia
我也试过旧的tidyhtml
,没有得到必要的结果.对于新的 tidy-html5
找不到 python 接口.你能帮我吗,要么
I tried also old tidyhtml
, couldn't get a necessary result. And for new tidy-html5
could not find a python interface.
Can you help me please, either
- 找到一个可以完成这项工作的解析器
- 如果没有,请为算法或与此类算法相关的任何知识来源提供建议
谢谢!
推荐答案
html.parser.HTMLParser
擅长解析标签汤,SAX XMLGenerator
类有一个基于事件生成 XML 的便捷 API.
html.parser.HTMLParser
is good at parsing tag soup, and the SAX XMLGenerator
class has a convenient API to generate XML based on events.
并不是所有的位都在这里实现,特别是标签的刚性"/重量"约束(现在我们所做的只是关闭标签,我们期望它是正确的嵌套),但基本的想法似乎可行.
Not all of the bits are implemented here, especially not the "rigidness"/"weight" constraints for the tags (right now all we do is just close the tag with what we expect it to be to make nesting correct), but the basic idea seems to work.
输出是
<trn>choya - <i><com>a cholla cactus </com> lat. <i>Cylindropuntia</i></i> native to US</trn>
这是有效的 XML,嵌套.
which is valid XML, nesting-wise.
祝你好运!
import html.parser
import io
from xml.sax.saxutils import XMLGenerator
class Reconstructor(html.parser.HTMLParser):
def __init__(self):
super().__init__()
self.op_stream = []
self.tag_stack = []
def handle_startendtag(self, tag, attrs):
self.op_stream.append(('startendtag', (tag, attrs)))
def handle_starttag(self, tag, attrs):
self.op_stream.append(('starttag', (tag, attrs)))
self.tag_stack.append(tag)
def handle_endtag(self, tag):
expected_tag = self.tag_stack[-1]
if tag != expected_tag:
print('mismatch closing <{}>, expected <{}>'.format(tag, expected_tag))
# TODO: implement logic to figure out the correct order for the tags here
# and reorder tag_stack accordingly.
stack_tag = self.tag_stack.pop(-1)
self.op_stream.append(('endtag', (stack_tag, tag)))
def handle_charref(self, name):
self.op_stream.append(('charref', (name,)))
def handle_entityref(self, name):
self.op_stream.append(('entityref', (name,)))
def handle_data(self, data):
self.op_stream.append(('data', (data,)))
def handle_comment(self, data):
self.op_stream.append(('comment', (data,)))
def handle_decl(self, decl):
self.op_stream.append(('decl', (decl,)))
def handle_pi(self, data):
self.op_stream.append(('pi', (data,)))
def generate_xml(self):
stream = io.StringIO()
xg = XMLGenerator(stream, encoding='utf-8')
for op, args in self.op_stream:
if op in ('startendtag', 'starttag'):
tag, attrib = args
xg.startElement(tag, dict(attrib))
if op == 'startendtag':
xg.endElement(tag)
elif op == 'endtag':
tag = args[0]
xg.endElement(tag)
elif op == 'data':
xg.characters(args[0])
else:
raise NotImplementedError('Operator not implemented: %s' % op)
xg.endDocument()
return stream.getvalue()
xr = Reconstructor()
xr.feed('<trn>choya - <i><com>a cholla cactus </i> lat. <i>Cylindropuntia</com></trn> native to US</i>')
y = xr.generate_xml()
print(y)
这篇关于Lenient XML python parser:解决xml标签重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!