从 Tumblr API 打印 20 多个帖子 [英] Print more than 20 posts from Tumblr API

查看:35
本文介绍了从 Tumblr API 打印 20 多个帖子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下午好,

我对 Python 非常陌生,但我正在尝试编写一个代码,让我可以下载所有帖子(包括注释")从指定的 Tumblr 帐户到我的计算机.

I'm very new to Python, but I'm trying to write a code which will allow me to download all of the posts (including the "notes") from a specified Tumblr account to my computer.

鉴于我缺乏编码经验,我试图找到一个预制的脚本来让我做到这一点.我在 GitHub 上找到了几个出色的脚本,但没有一个真正返回 Tumblr 帖子中的注释(据我所知,如果有人知道,请纠正我!).

Given my inexperience with coding, I was trying to find a pre-made script which would allow me to do this. I found several brilliant scripts on GitHub, but none of them actually return the notes from Tumblr posts (as far as I can see, although please do correct me if anyone knows of one that does!).

因此,我尝试编写自己的脚本.我使用下面的代码取得了一些成功.它打印来自给定 Tumblr 的最新 20 个帖子(尽管格式相当丑陋——基本上数百行文本都打印到记事本文件的一行中):

Therefore, I tried to write my own script. I've had some success with the code below. It prints the most recent 20 posts from the given Tumblr (albeit in a rather ugly format -- essentially hundreds of lines of texts all printed into one line of a notepad file):

#This script prints all the posts (including tags, comments) and also the 
#first 20notes from all the Tumblr blogs.

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')

#offset = 0

# Make the request
client.posts('staff', limit=2000, offset=0, reblog_info=True, notes_info=True, 
filter='html')
#print out into a .txt file
with open('out.txt', 'w') as f:
print >> f, client.posts('staff', limit=2000, offset=0, reblog_info=True, 
notes_info=True, filter='html')

但是,我希望脚本不断打印帖子,直到到达指定博客的末尾.

However, I want the script to continuously print posts until it reaches the end of the specified blog.

我搜索了这个网站,发现了一个非常相似的问题(Getting only 20 posts通过 PyTumblr 返回),已经由 stackoverflow 用户 poke 回答.但是,我似乎无法真正实现 poke 的解决方案,以便它适用于我的数据.事实上,当我运行以下脚本时,根本没有产生任何输出.

I searched this site and found a very similar question (Getting only 20 posts returned through PyTumblr), which has been answered by the stackoverflow user poke. However, I can't seem to actually implement poke's solution so that it works for my data. Indeed, when I run the following script, no output at all is produced.

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')
def getAllPosts (client, blog):
offset = 0
while True:
    posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
    if not posts:
        return

    for post in posts:
        yield post


    offset += 20

我应该注意到这个网站上有几个帖子(例如使用 Tumblr API 获取超过 50 条笔记)关于 Tumblr 笔记,其中大多数询问如何为每个帖子下载超过 50 条笔记.我对每个帖子只有 50 个笔记感到非常满意,这是我想要增加的帖子数量.

I should note that there are several posts on this site (e.g.Getting more than 50 notes with Tumblr API) about Tumblr notes, most of them asking how to download more than 50 notes per posts. I'm perfectly happy with just 50 notes per post, it is the number of posts that I would like to increase.

此外,我已将这篇文章标记为 Python,但是,如果有更好的方法来使用另一种编程语言获取我需要的数据,那就太好了.

Also, I've tagged this post as Python, however, if there is a better way to get the data I require using another programming language, that would be more than okay.

非常感谢您抽出宝贵时间!

Thank you very much in advance for your time!

推荐答案

tl;dr 如果你只想看答案,它在标题修正版之后的底部H3>

第二个代码片段是一个生成器,它一个接一个地生成帖子,因此您必须将它用作循环之类的东西的一部分,然后对输出执行一些操作.这是您的代码,其中包含一些额外的代码,这些代码迭代生成器并打印出它返回的数据.

tl;dr If you'd like to just see the answer, it's at the bottom after the heading A Corrected Version

The second code snippet is a generator that yields posts one by one, so you have to use it as part of something like a loop and then do something with the output. Here's your code with some additional code that iterates over the generator and prints out the data it gets back.

import pytumblr

def getAllPosts (client, blog):
    offset = 0
    while True:
        posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
        if not posts:
            return

        for post in posts:
            yield post

        offset += 20

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')

# use the generator getAllPosts
for post in getAllPosts(client, blog):
    print(post)

但是,该代码中有几个错误.getAllPosts 不会只产生每个帖子,它还将返回其他内容,因为它会遍历 API 响应,正如您从我在 ipython 中运行的这个示例中看到的壳.

However, that code has a couple bugs in it. getAllPosts won't yield just each post, it will also return other things because it will iterate over the API response, as you can see from this example I ran in my ipython shell.

In [7]: yielder = getAllPosts(client, 'staff')

In [8]: next(yielder)
Out[8]: 'blog'

In [9]: next(yielder)
Out[9]: 'posts'

In [10]: next(yielder)
Out[10]: 'total_posts'

In [11]: next(yielder)
Out[11]: 'supply_logging_positions'

In [12]: next(yielder)
Out[12]: 'blog'

In [13]: next(yielder)
Out[13]: 'posts'

In [14]: next(yielder)
Out[14]: 'total_posts'

发生这种情况是因为 getAllPosts 中的 posts 对象是一个字典,它包含的不仅仅是来自 staff 博客的每个帖子 - 它也包含包含博客包含的帖子数量、博客的描述、上次更新时间等项目.原样的代码可能会导致无限循环,因为以下条件:

This happens because the posts object in getAllPosts is a dictionary that contains much more than just each post from the staff blog - it also has items like how many posts the blog contains, the blog's description, when it was last updated, etc. The code as-is could potentially result in an infinite loop because the following conditional:

if not posts:
    return

由于响应结构的原因,永远不会是真的,因为来自 pytumblr 的空 Tumblr API 响应看起来像这样:

Would never be true because of the response structure, because an empty Tumblr API response from pytumblr looks like this:

{'blog': {'ask': False,
  'ask_anon': False,
  'ask_page_title': 'Ask me anything',
  'can_send_fan_mail': False,
  'can_subscribe': False,
  'description': '',
  'followed': False,
  'is_adult': False,
  'is_blocked_from_primary': False,
  'is_nsfw': False,
  'is_optout_ads': False,
  'name': 'asdfasdf',
  'posts': 0,
  'reply_conditions': '3',
  'share_likes': False,
  'subscribed': False,
  'title': 'Untitled',
  'total_posts': 0,
  'updated': 0,
  'url': 'https://asdfasdf.tumblr.com/'},
 'posts': [],
 'supply_logging_positions': [],
 'total_posts': 0}

if not posts 将根据该结构进行检查,而不是 posts 字段(此处为空列表),因此条件永远不会失败,因为响应字典不为空(请参阅:Python 中的真值测试).

if not posts would be checked against that structure, rather than the posts field (which is an empty list here), so the condition would never fail because the response dictionary isn't empty (see: Truth Value Testing in Python).

这里的代码(大部分经过测试/验证)修复了 getAllPosts 实现中的循环,然后使用该函数检索帖子并将它们转储到名为 (BLOG_NAME) 的文件中-posts.txt.

Here's code (mostly tested/verified) that fixes the loop from your getAllPosts implementation, and then uses the function to retrieve posts and dumps them to a file with the name (BLOG_NAME)-posts.txt.

import pytumblr


def get_all_posts(client, blog):
    offset = 0
    while True:
        response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)

        # Get the 'posts' field of the response        
        posts = response['posts']

        if not posts: return

        for post in posts:
            yield post

        # move to the next offset
        offset += 20


client = pytumblr.TumblrRestClient('secrety-secret')
blog = 'staff'

# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
    for post in get_all_posts(client, blog):
        print >>out_file, post
        # if you're in python 3.x, use the following
        # print(post, file=out_file)

这只是 API 帖子响应的直接文本转储,因此,如果您需要让它看起来更好看或其他什么,这取决于您.

This will just be a straight text dump of the API's post responses, so if you need to make it look nicer or anything, that's up to you.

这篇关于从 Tumblr API 打印 20 多个帖子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆