Feedparser-从Google Reader检索旧邮件 [英] Feedparser - retrieve old messages from Google Reader

查看:104
本文介绍了Feedparser-从Google Reader检索旧邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python中的feedparser库从本地报纸中检索新闻(我的意图是通过该语料库进行自然语言处理),并希望能够从RSS feed中检索许多过去的条目.

I'm using the feedparser library in python to retrieve news from a local newspaper (my intent is to do Natural Language Processing over this corpus) and would like to be able to retrieve many past entries from the RSS feed.

我对RSS的技术问题不太了解,但我认为应该可行(例如,我看到Google Reader和Feedly可以在移动滚动条时按需"执行此操作) .

I'm not very acquainted with the technical issues of RSS, but I think this should be possible (I can see that, e.g., Google Reader and Feedly can do this ''on demand'' as I move the scrollbar).

当我执行以下操作时:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

我只有十几个条目.我当时在想几百个.如果可能的话,也许是上个月的所有条目.只能使用feedparser来做到这一点吗?

I get only a dozen entries or so. I was thinking about hundreds. Maybe all entries in the last month, if possible. Is it possible to do this only with feedparser?

我打算从rss feed中仅获取新闻项的链接,并使用BeautifulSoup解析整个页面以获得我想要的文本.另一种解决方案是使用爬虫,该爬虫跟随页面中的所有本地链接以获取许多新闻,但我现在暂时避免这样做.

I intend to get from the rss feed only the link to the news item and parse the full page with BeautifulSoup to obtain the text I want. An alternate solution would be a crawler that follows all local links in the page to get a lot of news items, but I want to avoid that for now.

-

出现的一种解决方案是使用Google Reader RSS缓存:

One solution that appeared is to use the Google Reader RSS cache:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml?n=1000

但是要访问此页面,我必须登录Google Reader.有人知道我该怎么用python吗? (我真的对Web一无所知,我通常只会弄乱数值演算).

But to access this I must be logged in to Google Reader. Anyone knows how I do that from python? (I really don't know a thing about web, I usually only mess with numerical calculus).

推荐答案

您将仅收到十几个条目,因为这是供稿所包含的内容.如果您想要历史数据,则必须找到该数据的提要/数据库.

You're only getting a dozen entries or so because that's what the feed contains. If you want historic data you will have to find a feed/database of said data.

查看此 ReadWriteWeb文章,以获取有关在网络上查找开放数据的一些资源.

Check out this ReadWriteWeb article for some resources on finding open data on the web.

请注意,正如您的标题所示,Feedparser与此无关. Feedparser解析您提供的内容.除非找到历史数据并将其传递给它,否则它无法找到历史数据.它只是一个解析器.希望这能说明问题! :)

Note that Feedparser has nothing to do with this as your title suggests. Feedparser parses what you give it. It can't find historic data unless you find it and pass it into it. It is simply a parser. Hope that clears things up! :)

这篇关于Feedparser-从Google Reader检索旧邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆