使用lxml解析HTML时如何保留名称空间信息? [英] How to preserve namespace information when parsing HTML with lxml?

查看:64
本文介绍了使用lxml解析HTML时如何保留名称空间信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

>>> from lxml.etree import HTML, tostring
>>> tostring(HTML('<fb:like>'))
'<html><body><like/></body></html>'

请注意标记如何从< fb:like> 变为简单的< like> .

Note how the tag turns from <fb:like> to simply <like>.

这使得处理将XFBML与lxml合并的页面变得更加困难.(< g:plusone></g:plusone> 也会发生同样的事情)

This makes processing pages that incorporate XFBML with lxml much harder. (Same thing happens to <g:plusone></g:plusone>)

感谢您的帮助.

推荐答案

解决此问题的一种方法是

One way to fix this issue is to patch libxml2.

请参考SAX2.c中的libxml2.9.2(https://git.gnome.org/browse/libxml2/tree/?id=v2.9.2)的源代码(https://git.gnome.org/browse/libxml2/tree/SAX2.c?id = v2.9.2)(用于创建DOM树的内部SAX解析器)在第1699行处,在HTML模式下不解析具有xmlns的属性,并且像任何其他方式一样对它们进行解析在第1740行的其他属性.因此,调整第1622行是有意义的,该行将名称分为前缀和本地部分.更改:

Referring to the source code of libxml2.9.2 (https: //git.gnome.org/browse/libxml2/tree/?id=v2.9.2), in SAX2.c (https: //git.gnome.org/browse/libxml2/tree/SAX2.c?id=v2.9.2) (the internal SAX parser used to create the DOM tree) at line 1699 attributes with xmlns are not parsed when in HTML mode, and they are parsed like any other attributes at line and 1740. Consequently, it makes sense to adjust line 1622, which splits the name into prefix and local part. Change:

name = xmlSplitQName(ctxt, fullname, &prefix);

进入

if (!ctxt->html) {
    name = xmlSplitQName(ctxt, fullname, &prefix);
} else {
    name = xmlStrdup(fullname);
    prefix = NULL;
}

然后libxml2会将诸如< o:p> 之类的标签视为用于名称为 o:p 的元素,即,冒号包含在元素中名称没有特殊含义.这是HTML中的正确解释.例如, HTML5规范说:

Then libxml2 will consider tags such as <o:p> to be for elements with name o:p, that is, the colon is included in the element name with no special meaning. This is the correct interpretation in HTML. For example, the HTML5 specification says:

在HTML语法中,名称空间前缀和名称空间声明与XML中的效果不同.例如,冒号在HTML元素名称中没有特殊含义.

In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.

希望此更改将被批准用于libxml2的将来版本.有一个开放的错误报告(https://bugzilla.gnome.org/show_bug.cgi?id=654146).

Hopefully this change will be approved for a future version of libxml2. There is an open bug report (https: //bugzilla.gnome.org/show_bug.cgi?id=654146).

这篇关于使用lxml解析HTML时如何保留名称空间信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆