lxml html5parser忽略“namespaceHTMLElements = False”选项 [英] lxml html5parser ignores "namespaceHTMLElements=False" option

查看:273
本文介绍了lxml html5parser忽略“namespaceHTMLElements = False”选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

lxml html5parser 似乎忽略了任何 namespaceHTMLElements = False 选项我传递给它。它将我给它的所有元素放入HTML命名空间,而不是(预期的)void命名空间。



下面是一个简单的例子,它重现了这个问题:

echo< p> | python -cfrom sys import stdin; \ $ b $ from lxml.html import html5parser as h5,tostring; \
print tostring(h5.parse(stdin,h5.HTMLParser(namespaceHTMLElements = False)) )

输出结果如下:

< html:html xmlns:html =http://www.w3.org/1999/xhtml>< html:head> ;< / HTML:头>< HTML:体>< HTML:p为H.
< / html:p>< / html:body>< / html:html>

可以看出, html 元素以及HTML命名空间中的所有其他元素。



预期的输出是这样的:

 < html>< head>< / head>< body>< p> 
< / p>< / body>< / html>

我认识到 namespaceHTMLElements 是一个html5lib选项,而不是本地lxml选项,lxml本身直接执行任何操作。 lxml应该调用html5lib并将该选项传递给html5lib,以便html5lib按照预期使用它。






Update 2016-02-17



我还没有找到一种方法让lxml html5parser兑现 namespaceHTMLElements 。但要清楚的是,替代方法是直接调用html5lib,如下所示:
$ b

  echo < p> 中| python -cfrom sys import stdin; \ 
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin,treebuilder ='lxml',namespaceHTMLElements = False); \
print html.tostring(doc)






< h3>更多详情

我已经知道的一些事情:


  • html5lib完全符合符合HTML规范的要求,包括要求 html 元素必须放置在HTML名称空间中 - 即html5lib所做的

  • 但是,html5lib提供了 namespaceHTMLElements = False 作为覆盖该默认设置的选项将 html 元素放入HTML命名空间行为中。 b $ b
  • 当我直接使用html5lib(不是通过lxml),并且将 namespaceHTMLElements = False 传递给它时, html 元素进入void命名空间。

  • 将一些printf嵌入到html5lib源代码中,我发现: p>


    • lxml 实际上是使用 namespaceHTMLElements = False 调用html5lib as预计

    • ,lxml似乎要调用html5lib 两次:首先没有 namespaceHTMLElements ,然后第二次使用 namespaceHTMLElements = False







结论关于何处找到原因



鉴于上述情况,清楚问题在于lxml和html5lib之间的接口。我不知道为什么lxml两次调用html5lib,但我认为这可能是因为某种原因,它在做什么之前首先尝试创建自己的 XHTMLParser 的新实例我实际上是要求它做,这只是为了创建自己的 HTMLParser 的实例。



所以也许事实上,它确实对html5lib进行了两次调用,导致html5lib将第一次调用导致的默认 namespaceHTMLElements = True 行为锁定,然后忽略 namespaceHTMLElements = False 指令当它在第二次调用中看到它时。 可能按照它的方式进行两次调用,lxml在html5lib中打破了一些假设,或者实际上滥用html5lib API的方式并不打算用于设计。



或者也许是原因完全不是lxml对html5lib进行两次单独调用的结果,而是它使用htm时的一些其他问题l5lib界面。

无论如何,我有兴趣听取其他人是否有其他人遇到这个问题,并且有一个解决方法 - 或者至少有一些理解为什么它发生了。

解决方案

在源代码中,我遵循lxml如何将参数传递给html5lib。大多数函数都有一个完成* kws,然后交给下一个函数。在调用实际的html5语法分析器的最后一个步骤中,这被删除,解析器被调用了2个固定参数。



(我昨天遇到同样的问题,刚刚得到这个问题,并忘记了细节,允许我放弃任何代码片段和引用。)

无论如何,这证实了在2018年,调用直接使用html5lib仍然是首选方式,如果调用lxml自己的解析器不是一个选项。



(我的用例是:解析蹩脚的html并且有xpath 。)


The lxml html5parser seems to ignore any namespaceHTMLElements=False option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace.

Here’s a simple case that reproduces the problem:

echo "<p>" | python -c "from sys import stdin; \
  from lxml.html import html5parser as h5, tostring; \
  print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))"

The output from that is this:

<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:p>
</html:p></html:body></html:html>

As can be seen, the html element and all other elements there are in the HTML namespace.

The expected output is instead this:

<html><head></head><body><p>
</p></body></html>

I recognize that namespaceHTMLElements is an html5lib option, not a native lxml option that lxml does anything itself with directly. lxml is supposed to just call html5lib and pass that option on to html5lib in such a way that html5lib uses it as expected.


Update 2016-02-17

I still haven’t found a way to get the lxml html5parser to honor the namespaceHTMLElements. But to be clear, the alternative is to instead just call html5lib directly, like this:

echo "<p>" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print html.tostring(doc)"


More details

Some things I already know:

  • html5lib fully conforms to the requirements of the HTML spec, including the requirement that the html element must be placed into the HTML namespace—which html5lib does
  • However, html5lib provides namespaceHTMLElements=False as an option to override that default "put the html element into the HTML namespace" behavior.
  • When I use html5lib directly (not through lxml), and I pass namespaceHTMLElements=False to it, everything works as expected—the html element goes into the void namespace.
  • Hacking some printf into the html5lib sources, I observe that:

    • lxml is actually calling html5lib with namespaceHTMLElements=False as expected
    • but, lxml seems to be calling into html5lib twice: first without namespaceHTMLElements, then a second time with namespaceHTMLElements=False

Conclusion about where the cause is to be found

Given the above, it’s clear that the problem is in the interface between lxml and html5lib. I’m not sure why lxml is calling into html5lib twice but I think it may be because for some reason it first tries to create a new instance of its own XHTMLParser before doing what I’m actually asking it to do, which is just to create an instance of its own HTMLParser.

So maybe the fact that it does make two calls to html5lib causes html5lib to sort of "lock in" the default namespaceHTMLElements=True behavior that results from the first call, and then ignore the namespaceHTMLElements=False directive when it sees it in the second call.

Maybe in making two calls the way it does, lxml is either breaking some assumption in html5lib, or is actually misusing the html5lib API in a way that it by design is not intended to be used.

Or maybe the cause isn’t at all the result of lxml making two separate calls to html5lib, but instead some other problem in the way it’s using the html5lib interface.

Anyway, I’m interested in hearing from others about whether anybody else has run into this problem and has a workaround—or at least have some insight into why it’s happening.

解决方案

I have followed in the source-code, how lxml hands params to html5lib. Most of the functions have a finishing *kws, which is then handed to the next function. In one of the last steps when calling the actual html5 parser, this is dropped and the parser is called with 2 fixed params.

(I had the same problem yesterday, and just got to this question, and forgot the tiny details, allow me to forgo any code-snippets, and references.)

Anyway, this confirms that in 2018, calling the html5lib directly with is still the preferred way, if calling lxml's own parser is not an option for some reason.

(My use-case was: parse crappy html and have xpath.)

这篇关于lxml html5parser忽略“namespaceHTMLElements = False”选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆