Python lxml更改标签层次结构? [英] Python lxml changes tag hierarchy?

查看:219
本文介绍了Python lxml更改标签层次结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用lxml时遇到了一个小问题。我正在将XML文档转换为HTML文档。
原始的XML看起来像这样(它看起来像HTML,但它在XML文档中):

 < p>本地化 - 艾菲尔铁塔? Paris或Vegas< p> Bayes定理p(A | B)< / p>< / p> 

当我这样做时(item是上面的字符串)

  lxml.html.tostring(lxml.html.fromstring(item))

我得到这个:

 < div>< p>本地化 - 埃菲尔铁塔?巴黎或维加斯< / p>< p>贝叶斯定理p(A | B)< / p>< / div> 

我对< div>没有任何问题,但事实是'贝叶斯定理'的段落不再嵌套在外段是一个问题。

任何人都知道为什么lxml正在这样做以及如何阻止它?感谢。

解决方案

lxml正在这样做,因为它不存储无效的HTML,并且< p> ; 元素不能嵌套在HTML中:


P元素代表一个段落。它不能包含块级元素(包括P本身)。


I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc):

<p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>

When I do this (item is the string above)

lxml.html.tostring(lxml.html.fromstring(item))

I get this:

<div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div>

I don't have any problem with the <div>s, but the fact that the 'Bayes theorem' paragraph is no longer nested within the outer paragraph is a problem.

Anyone know why lxml is doing this and how to stop it? Thanks.

解决方案

lxml is doing this because it doesn't store invalid HTML, and <p> elements can't be nested in HTML:

The P element represents a paragraph. It cannot contain block-level elements (including P itself).

这篇关于Python lxml更改标签层次结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆