如何使用lxml，XPath和Python从网页中提取链接？ [英] How to extract links from a webpage using lxml, XPath and Python?

查看：1018 发布时间：2018/7/11 17:11:11 python screen-scraping hyperlink lxml extraction

本文介绍了如何使用lxml，XPath和Python从网页中提取链接？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个xpath查询：

I've got this xpath query:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

它提取所有带有title属性的链接 - 并在中提供 href FireFox的Xpath检查程序加载项。

It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on.

但是，我似乎无法在 lxml 中使用它。

However, I cannot seem to use it with lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

这不会产生 lxml （空列表）。


This produces no result from lxml (empty list).

如何获取包含<$ c $属性标题的超链接的 href 文本（链接） c> lxml 在Python下？

How would one grab the href text (link) of a hyperlink containing the attribute title with lxml under Python?

推荐答案

我能够使用以下代码：

I was able to make it work with the following code:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

这篇关于如何使用lxml，XPath和Python从网页中提取链接？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用lxml，XPath和Python从网页中提取链接？ [英] How to extract links from a webpage using lxml, XPath and Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用lxml，XPath和Python从网页中提取链接？ [英] How to extract links from a webpage using lxml, XPath and Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭