从网页中自动提取提要链接(atom、rss 等) [英] Automatically Extracting feed links (atom, rss,etc) from webpages

查看:57
本文介绍了从网页中自动提取提要链接(atom、rss 等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的 URL 列表,我的任务是将它们提供给一个 python 脚本,如果有的话,它应该吐出提要 URL.是否有可以提供帮助的 API 库或代码?

I have a huge list of URLs and my task is to feed them to a python script which should spit out the feed urls if there are any. Is there an API library or code out there that can help?

推荐答案

我在推荐 Beautiful Soup 用于解析 HTML 然后获取 标签,其中引用了提要.我常用的代码:

I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link rel="alternate"> tags, where the feeds are referenced. The code I usually use:

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

这篇关于从网页中自动提取提要链接(atom、rss 等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆