Reddit API 返回无用的 JSON [英] Reddit API returning useless JSON
问题描述
我正在尝试使用他们的 API 和 Python 的 urllib2 从 Reddit 中抓取新故事,但我不断收到这样的 JSON 文档:
{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': 没有任何 }}
这是我的代码:
导入json导入时间导入 urllib2def get_submissions(after=None):url = 'http://reddit.com/r/all/new.json?limit=100'如果之后:url += '&after=%s' % after_user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'_request = urllib2.Request(url, headers={'User-agent': _user_agent})_json = json.loads(urllib2.urlopen(_request).read())返回 [story for story in _json['data']['children']], _json['data']['after']如果 __name__ == '__main__':之后 = 无故事 = []限制 = 1而 len(stories) <限制:new_stories, after = get_submissions(after)故事.扩展(新故事)time.sleep(2) # Reddit API 每两秒允许一个请求.打印到目前为止收集的 %d 个故事.. 睡了两秒钟."% len(故事)
我写的内容相当简短和直接,但我显然忽略了一些东西,或者我对 API 或 urllib2 的工作原理没有完全了解.
这是来自 API 的示例页面.>
这是怎么回事?
EDIT 尝试在另一个浏览器中加载示例页面后,我还看到了我在页面顶部发布的 JSON.不过它似乎只适用于//new.json.如果我尝试//hot.json 或只是/.json,我会得到我想要的.
Edit:从 2013/02/22 开始,所需的 new
排序不再需要 sort=new
作为 URL 参数添加.这是因为在 /new
路由下不再提供 rising
排序,而是由 /rising
[来源].
URL 的问题 http://reddit.com/r/all/new.json?limit=100 是 new
页面默认使用 rising
排序.如果您已登录,并且已将默认排序更改为 new
,那么您真正看到的是页面 http://reddit.com/r/all/new.json?limit=100&sort=new.注意添加了参数 sort=new
.
因此结果是正确的,只是/r/all的上升视图没有更新.
在相关说明中,我强烈建议使用 PRAW(python reddit API 包装器) 而不是编写自己的代码,如果您打算使用 API 的多个部分,而不是编写自己的代码.这是您想要的相关代码:
导入大虾r = praw.Reddit('您的描述性用户代理名称')列表 = 列表(r.get_subreddit('all').get_new_by_date())打印列表
如果您只是想遍历提交的内容,可以省略 list()
部分.
I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:
{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}
Here is my code:
import json
import time
import urllib2
def get_submissions(after=None):
url = 'http://reddit.com/r/all/new.json?limit=100'
if after:
url += '&after=%s' % after
_user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
_request = urllib2.Request(url, headers={'User-agent': _user_agent})
_json = json.loads(urllib2.urlopen(_request).read())
return [story for story in _json['data']['children']], _json['data']['after']
if __name__ == '__main__':
after = None
stories = []
limit = 1
while len(stories) < limit:
new_stories, after = get_submissions(after)
stories.extend(new_stories)
time.sleep(2) # The Reddit API allows one request every two seconds.
print '%d stories collected so far .. sleeping for two seconds.' % len(stories)
What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.
Here's an example page from the API.
What's the deal?
EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.
Edit: As of 2013/02/22, the desired new
sort no longer requires sort=new
to be added as a URL parameter. This is because the rising
sort is no longer provided under the /new
route, but is provided by /rising
[source].
The problem with the URL http://reddit.com/r/all/new.json?limit=100 is that the new
pages by default use the rising
sort. If you are logged in, and you have changed the default sort to new
then what you really see is the result for the page http://reddit.com/r/all/new.json?limit=100&sort=new. Notice the addition of the parameter sort=new
.
Thus the result is correct, it is just that the rising view has not been updated for /r/all.
On a related note, I strongly suggest using PRAW (the python reddit API wrapper) rather than writing your own code if you plan to use more than just a single part of the API. Here's the relevant code that you want:
import praw
r = praw.Reddit('YOUR DESCRIPTIVE USER AGENT NAME')
listing = list(r.get_subreddit('all').get_new_by_date())
print listing
If you simply want to iterate over the submissions you can omit the list()
part.
这篇关于Reddit API 返回无用的 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!