Python lxml更改标签层次结构? [英] Python lxml changes tag hierarchy?
问题描述
我在使用lxml时遇到了一个小问题。我正在将XML文档转换为HTML文档。
原始的XML看起来像这样(它看起来像HTML,但它在XML文档中):
< p>本地化 - 艾菲尔铁塔? Paris或Vegas< p> Bayes定理p(A | B)< / p>< / p>
当我这样做时(item是上面的字符串)
lxml.html.tostring(lxml.html.fromstring(item))
我得到这个:
< div>< p>本地化 - 埃菲尔铁塔?巴黎或维加斯< / p>< p>贝叶斯定理p(A | B)< / p>< / div>
我对< div>没有任何问题,但事实是'贝叶斯定理'的段落不再嵌套在外段是一个问题。
任何人都知道为什么lxml正在这样做以及如何阻止它?感谢。
lxml正在这样做,因为它不存储无效的HTML,并且< p> ;
元素不能嵌套在HTML中:
P元素代表一个段落。它不能包含块级元素(包括P本身)。
I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc):
<p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>
When I do this (item is the string above)
lxml.html.tostring(lxml.html.fromstring(item))
I get this:
<div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div>
I don't have any problem with the <div>s, but the fact that the 'Bayes theorem' paragraph is no longer nested within the outer paragraph is a problem.
Anyone know why lxml is doing this and how to stop it? Thanks.
lxml is doing this because it doesn't store invalid HTML, and <p>
elements can't be nested in HTML:
The P element represents a paragraph. It cannot contain block-level elements (including P itself).
这篇关于Python lxml更改标签层次结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!