自动提取从网页链接饲料(原子,RSS等) [英] Automatically Extracting feed links (atom, rss,etc) from webpages

查看:248
本文介绍了自动提取从网页链接饲料(原子,RSS等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的URL列表,我的任务就是给他们喂到Python脚本应吐出的饲料的URL,如果有任何。有没有一个API库或code,有可以帮忙吗?

I have a huge list of URLs and my task is to feed them to a python script which should spit out the feed urls if there are any. Is there an API library or code out there that can help?

推荐答案

我第二次在华夫格推荐的用于解析HTML,然后让在&lt美丽的汤;链接rel =备用>标签,其中饲料参考。在code我通常使用:

I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link rel="alternate"> tags, where the feeds are referenced. The code I usually use:

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

这篇关于自动提取从网页链接饲料(原子,RSS等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆