如何使用 lxml、XPath 和 Python 从网页中提取链接? [英] How to extract links from a webpage using lxml, XPath and Python?
本文介绍了如何使用 lxml、XPath 和 Python 从网页中提取链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这个 xpath 查询:
I've got this xpath query:
/html/body//tbody/tr[*]/td[*]/a[@title]/@href
它提取具有标题属性的所有链接 - 并在 hrefrel="nofollow noreferrer">FireFox 的 Xpath 检查器附加组件.
It extracts all the links with the title attribute - and gives the href
in FireFox's Xpath checker add-on.
但是,我似乎无法将它与 lxml
一起使用.
However, I cannot seem to use it with lxml
.
from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.
# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href")
for x in hyperlinks:
print x # Print links in <a> tags, containing the title attribute
这不会从 lxml
(空列表)产生任何结果.
This produces no result from lxml
(empty list).
如何在 Python 下使用 lxml
获取包含属性标题的超链接的 href
文本(链接)?
How would one grab the href
text (link) of a hyperlink containing the attribute title with lxml
under Python?
推荐答案
我能够使用以下代码使其工作:
I was able to make it work with the following code:
from lxml import html, etree
from StringIO import StringIO
html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head/>
<body>
<table border="1">
<tbody>
<tr>
<td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
</tr>
<tr>
<td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
</tr>
</tbody>
</table>
</body>
</html>'''
tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')
>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']
这篇关于如何使用 lxml、XPath 和 Python 从网页中提取链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文