如何使用html5lib解析HTML,并使用XPath查询已解析的HTML? [英] How can I parse HTML with html5lib, and query the parsed HTML with XPath?

查看:264
本文介绍了如何使用html5lib解析HTML,并使用XPath查询已解析的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用html5lib将html页面解析为可以使用xpath查询的内容. html5lib的文档接近于零,我花了太多时间试图解决这个问题.最终目标是拔出表的第二行:

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

所以让我们尝试一下:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

看起来不错,让我们看看还有什么:

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

认真.我打算使用一些xpath来获取所需的数据,但这似乎不起作用.那我该怎么办?我愿意尝试不同的库和方法.

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

推荐答案

缺少文档是避免使用IMO图书馆的好理由,无论它多么酷.您是否愿意使用html5lib?您是否看过 lxml.html ?

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

这是使用lxml做到这一点的一种方法:

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

结果:

['Header', 'Want This']

这篇关于如何使用html5lib解析HTML,并使用XPath查询已解析的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆