如何使用lxml,XPath和Python从网页中提取链接? [英] How to extract links from a webpage using lxml, XPath and Python?
本文介绍了如何使用lxml,XPath和Python从网页中提取链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这个xpath查询:
I've got this xpath query:
/html/body//tbody/tr[*]/td[*]/a[@title]/@href
它提取所有带有title属性的链接 - 并在中提供 href
FireFox的Xpath检查程序加载项。
It extracts all the links with the title attribute - and gives the href
in FireFox's Xpath checker add-on.
但是,我似乎无法在 lxml
中使用它。
However, I cannot seem to use it with lxml
.
from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.
# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href")
for x in hyperlinks:
print x # Print links in <a> tags, containing the title attribute
这不会产生 lxml $的结果c $ c>(空列表)。
This produces no result from lxml
(empty list).
如何获取包含<$ c $属性标题的超链接的 href
文本(链接) c> lxml 在Python下?
How would one grab the href
text (link) of a hyperlink containing the attribute title with lxml
under Python?
推荐答案
我能够使用以下代码:
I was able to make it work with the following code:
from lxml import html, etree
from StringIO import StringIO
html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head/>
<body>
<table border="1">
<tbody>
<tr>
<td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
</tr>
<tr>
<td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
</tr>
</tbody>
</table>
</body>
</html>'''
tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')
>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']
这篇关于如何使用lxml,XPath和Python从网页中提取链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文