为什么用lxml处理XHTML文档(在python中)时xpath无法正常工作? [英] Why doesn't xpath work when processing an XHTML document with lxml (in python)?

查看:77
本文介绍了为什么用lxml处理XHTML文档(在python中)时xpath无法正常工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在针对以下测试文档进行测试:

I am testing against the following test document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
        <title>hi there</title>
    </head>
    <body>
        <img class="foo" src="bar.png"/>
    </body>
</html>

如果我使用lxml.html解析文档,则可以使用xpath获得IMG:

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

但是,如果我将文档解析为XML并尝试获取IMG标签,则会得到空结果:

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]

我可以直接导航到该元素:

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

但是,那当然对我处理任意文档没有帮助.我还希望能够查询etree以获得可以直接标识此元素的xpath表达式,从技术上讲,我可以做到这一点:

But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

但是,同样,xpath显然对解析任意文档没有用.

But that xpath is, again, obviously not useful for parsing arbitrary documents.

很显然,我在这里缺少一些关键问题,但我不知道它是什么.我最好的猜测是,它与名称空间有关,但是唯一定义的名称空间是默认名称,我不知道关于名称空间我还需要考虑什么.

Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.

那么,我想念什么?

推荐答案

问题是名称空间.当解析为XML时,img标签位于 http://www.w3.org/1999/xhtml 名称空间,因为它是元素的默认名称空间.您要在没有名称空间的情况下请求img标签.

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

尝试一下:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

这篇关于为什么用lxml处理XHTML文档(在python中)时xpath无法正常工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆