Reddit API 返回无用的 JSON [英] Reddit API returning useless JSON

查看:49
本文介绍了Reddit API 返回无用的 JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用他们的 API 和 Python 的 urllib2 从 Reddit 中抓取新故事,但我不断收到这样的 JSON 文档:

{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': 没有任何 }}

这是我的代码:

导入json导入时间导入 urllib2def get_submissions(after=None):url = 'http://reddit.com/r/all/new.json?limit=100'如果之后:url += '&after=%s' % after_user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'_request = urllib2.Request(url, headers={'User-agent': _user_agent})_json = json.loads(urllib2.urlopen(_request).read())返回 [story for story in _json['data']['children']], _json['data']['after']如果 __name__ == '__main__':之后 = 无故事 = []限制 = 1而 len(stories) <限制:new_stories, after = get_submissions(after)故事.扩展(新故事)time.sleep(2) # Reddit API 每两秒允许一个请求.打印到目前为止收集的 %d 个故事.. 睡了两秒钟."% len(故事)

我写的内容相当简短和直接,但我显然忽略了一些东西,或者我对 API 或 urllib2 的工作原理没有完全了解.

这是来自 API 的示例页面.>

这是怎么回事?

EDIT 尝试在另一个浏览器中加载示例页面后,我还看到了我在页面顶部发布的 JSON.不过它似乎只适用于//new.json.如果我尝试//hot.json 或只是/.json,我会得到我想要的.

解决方案

Edit:从 2013/02/22 开始,所需的 new 排序不再需要 sort=new 作为 URL 参数添加.这是因为在 /new 路由下不再提供 rising 排序,而是由 /rising [来源].

<小时>

URL 的问题 http://reddit.com/r/all/new.json?limit=100new 页面默认使用 rising 排序.如果您已登录,并且已将默认排序更改为 new,那么您真正看到的是页面 http://reddit.com/r/all/new.json?limit=100&sort=new.注意添加了参数 sort=new.

因此结果是正确的,只是/r/all的上升视图没有更新.

在相关说明中,我强烈建议使用 PRAW(python reddit API 包装器) 而不是编写自己的代码,如果您打算使用 API 的多个部分,而不是编写自己的代码.这是您想要的相关代码:

导入大虾r = praw.Reddit('您的描述性用户代理名称')列表 = 列表(r.get_subreddit('all').get_new_by_date())打印列表

如果您只是想遍历提交的内容,可以省略 list() 部分.

I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:

{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}

Here is my code:

import json
import time
import urllib2

def get_submissions(after=None):
    url = 'http://reddit.com/r/all/new.json?limit=100'
    if after:
        url += '&after=%s' % after

    _user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
    _request = urllib2.Request(url, headers={'User-agent': _user_agent})
    _json = json.loads(urllib2.urlopen(_request).read())   

    return [story for story in _json['data']['children']], _json['data']['after']

if __name__ == '__main__':
    after = None
    stories = []
    limit = 1
    while len(stories) < limit:
        new_stories, after = get_submissions(after)
        stories.extend(new_stories)
        time.sleep(2) # The Reddit API allows one request every two seconds.
        print '%d stories collected so far .. sleeping for two seconds.' % len(stories)

What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.

Here's an example page from the API.

What's the deal?

EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.

解决方案

Edit: As of 2013/02/22, the desired new sort no longer requires sort=new to be added as a URL parameter. This is because the rising sort is no longer provided under the /new route, but is provided by /rising [source].


The problem with the URL http://reddit.com/r/all/new.json?limit=100 is that the new pages by default use the rising sort. If you are logged in, and you have changed the default sort to new then what you really see is the result for the page http://reddit.com/r/all/new.json?limit=100&sort=new. Notice the addition of the parameter sort=new.

Thus the result is correct, it is just that the rising view has not been updated for /r/all.

On a related note, I strongly suggest using PRAW (the python reddit API wrapper) rather than writing your own code if you plan to use more than just a single part of the API. Here's the relevant code that you want:

import praw
r = praw.Reddit('YOUR DESCRIPTIVE USER AGENT NAME')
listing = list(r.get_subreddit('all').get_new_by_date())
print listing

If you simply want to iterate over the submissions you can omit the list() part.

这篇关于Reddit API 返回无用的 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆