lxml html5parser忽略“namespaceHTMLElements = False”选项 [英] lxml html5parser ignores "namespaceHTMLElements=False" option
问题描述
lxml html5parser 似乎忽略了任何 namespaceHTMLElements = False
选项我传递给它。它将我给它的所有元素放入HTML命名空间,而不是(预期的)void命名空间。
下面是一个简单的例子,它重现了这个问题:
echo< p> | python -cfrom sys import stdin; \ $ b $ from lxml.html import html5parser as h5,tostring; \
print tostring(h5.parse(stdin,h5.HTMLParser(namespaceHTMLElements = False)) )
输出结果如下:
< html:html xmlns:html =http://www.w3.org/1999/xhtml>< html:head> ;< / HTML:头>< HTML:体>< HTML:p为H.
< / html:p>< / html:body>< / html:html>
可以看出, html
元素以及HTML命名空间中的所有其他元素。
预期的输出是这样的:
< html>< head>< / head>< body>< p>
< / p>< / body>< / html>
我认识到 namespaceHTMLElements
是一个html5lib选项,而不是本地lxml选项,lxml本身直接执行任何操作。 lxml应该调用html5lib并将该选项传递给html5lib,以便html5lib按照预期使用它。
Update 2016-02-17
我还没有找到一种方法让lxml html5parser兑现 namespaceHTMLElements
。但要清楚的是,替代方法是直接调用html5lib,如下所示:
$ b
echo < p> 中| python -cfrom sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin,treebuilder ='lxml',namespaceHTMLElements = False); \
print html.tostring(doc)
< h3>更多详情
我已经知道的一些事情: 将一些printf嵌入到html5lib源代码中,我发现: p>
html
元素必须放置在HTML名称空间中 - 即html5lib所做的
namespaceHTMLElements = False
作为覆盖该默认设置的选项将 html
元素放入HTML命名空间行为中。 b $ b namespaceHTMLElements = False
传递给它时, html
元素进入void命名空间。
namespaceHTMLElements = False
调用html5lib as预计
namespaceHTMLElements
,然后第二次使用 namespaceHTMLElements = False
结论关于何处找到原因
鉴于上述情况,清楚问题在于lxml和html5lib之间的接口。我不知道为什么lxml两次调用html5lib,但我认为这可能是因为某种原因,它在做什么之前首先尝试创建自己的 XHTMLParser
的新实例我实际上是要求它做,这只是为了创建自己的 HTMLParser
的实例。
所以也许事实上,它确实对html5lib进行了两次调用,导致html5lib将第一次调用导致的默认 namespaceHTMLElements = True
行为锁定,然后忽略 namespaceHTMLElements = False
指令当它在第二次调用中看到它时。 可能按照它的方式进行两次调用,lxml在html5lib中打破了一些假设,或者实际上滥用html5lib API的方式并不打算用于设计。
或者也许是原因完全不是lxml对html5lib进行两次单独调用的结果,而是它使用htm时的一些其他问题l5lib界面。
无论如何,我有兴趣听取其他人是否有其他人遇到这个问题,并且有一个解决方法 - 或者至少有一些理解为什么它发生了。
在源代码中,我遵循lxml如何将参数传递给html5lib。大多数函数都有一个完成* kws,然后交给下一个函数。在调用实际的html5语法分析器的最后一个步骤中,这被删除,解析器被调用了2个固定参数。
(我昨天遇到同样的问题,刚刚得到这个问题,并忘记了细节,允许我放弃任何代码片段和引用。)
无论如何,这证实了在2018年,调用直接使用html5lib仍然是首选方式,如果调用lxml自己的解析器不是一个选项。
(我的用例是:解析蹩脚的html并且有xpath 。)
The lxml html5parser seems to ignore any namespaceHTMLElements=False
option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace.
Here’s a simple case that reproduces the problem:
echo "<p>" | python -c "from sys import stdin; \
from lxml.html import html5parser as h5, tostring; \
print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))"
The output from that is this:
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:p>
</html:p></html:body></html:html>
As can be seen, the html
element and all other elements there are in the HTML namespace.
The expected output is instead this:
<html><head></head><body><p>
</p></body></html>
I recognize that namespaceHTMLElements
is an html5lib option, not a native lxml option that lxml does anything itself with directly. lxml is supposed to just call html5lib and pass that option on to html5lib in such a way that html5lib uses it as expected.
Update 2016-02-17
I still haven’t found a way to get the lxml html5parser to honor the namespaceHTMLElements
. But to be clear, the alternative is to instead just call html5lib directly, like this:
echo "<p>" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print html.tostring(doc)"
More details
Some things I already know:
- html5lib fully conforms to the requirements of the HTML spec, including the requirement that the
html
element must be placed into the HTML namespace—which html5lib does - However, html5lib provides
namespaceHTMLElements=False
as an option to override that default "put thehtml
element into the HTML namespace" behavior. - When I use html5lib directly (not through lxml), and I pass
namespaceHTMLElements=False
to it, everything works as expected—thehtml
element goes into the void namespace. Hacking some printf into the html5lib sources, I observe that:
- lxml is actually calling html5lib with
namespaceHTMLElements=False
as expected - but, lxml seems to be calling into html5lib twice: first without
namespaceHTMLElements
, then a second time withnamespaceHTMLElements=False
- lxml is actually calling html5lib with
Conclusion about where the cause is to be found
Given the above, it’s clear that the problem is in the interface between lxml and html5lib. I’m not sure why lxml is calling into html5lib twice but I think it may be because for some reason it first tries to create a new instance of its own XHTMLParser
before doing what I’m actually asking it to do, which is just to create an instance of its own HTMLParser
.
So maybe the fact that it does make two calls to html5lib causes html5lib to sort of "lock in" the default namespaceHTMLElements=True
behavior that results from the first call, and then ignore the namespaceHTMLElements=False
directive when it sees it in the second call.
Maybe in making two calls the way it does, lxml is either breaking some assumption in html5lib, or is actually misusing the html5lib API in a way that it by design is not intended to be used.
Or maybe the cause isn’t at all the result of lxml making two separate calls to html5lib, but instead some other problem in the way it’s using the html5lib interface.
Anyway, I’m interested in hearing from others about whether anybody else has run into this problem and has a workaround—or at least have some insight into why it’s happening.
I have followed in the source-code, how lxml hands params to html5lib. Most of the functions have a finishing *kws, which is then handed to the next function. In one of the last steps when calling the actual html5 parser, this is dropped and the parser is called with 2 fixed params.
(I had the same problem yesterday, and just got to this question, and forgot the tiny details, allow me to forgo any code-snippets, and references.)
Anyway, this confirms that in 2018, calling the html5lib directly with is still the preferred way, if calling lxml's own parser is not an option for some reason.
(My use-case was: parse crappy html and have xpath.)
这篇关于lxml html5parser忽略“namespaceHTMLElements = False”选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!