使用HTMLParser从页面中提取绝对链接 [英] Extract absolute links from a page using HTMLParser

查看：90 发布时间：2020/11/24 21:05:04 python html html-parsing

本文介绍了使用HTMLParser从页面中提取绝对链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用以下代码段使用HTMLParser提取页面上的所有链接.我得到了很多相对的URL.如何将其转换为域的绝对网址，例如www.exmaple.com

I'm using the following snippet to extract all the links on a page using HTMLParser. I get quite a few relative URLs. How can I convert these to absolute URLs for a domain e.g. www.exmaple.com

import htmllib, formatter
import urllib, htmllib, formatter

class LinksExtractor(htmllib.HTMLParser):

   def __init__(self, formatter):
      htmllib.HTMLParser.__init__(self, formatter)
      self.links = []

   def start_a(self, attrs):
      if len(attrs) > 0 :
         for attr in attrs :
            if attr[0] == "href":
                self.links.append(attr[1])

   def get_links(self):
      return self.links


format = formatter.NullFormatter()
htmlparser = LinksExtractor(format)

data = urllib.urlopen("http://cis.poly.edu/index.htm")
htmlparser.feed(data.read())
htmlparser.close()

links = htmlparser.get_links()
print links

谢谢

推荐答案

您要

urlparse.urljoin(base, url[, allow_fragments])

http://docs.python.org/library/urlparse.html# urlparse.urljoin

这允许您提供一个绝对或基本URL，并将其与一个相对URL结合在一起.即使它们有重叠的部分，它也应该起作用.

This allows you to give an absolute or base url, and join it with a relative url. Even if they have overlapping pieces, it should work.

这篇关于使用HTMLParser从页面中提取绝对链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用HTMLParser从页面中提取绝对链接 [英] Extract absolute links from a page using HTMLParser

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用HTMLParser从页面中提取绝对链接 [英] Extract absolute links from a page using HTMLParser

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭