如何通过tweepy将流式推文保存在json中? [英] How do I save streaming tweets in json via tweepy?

查看:146
本文介绍了如何通过tweepy将流式推文保存在json中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过在线课程学习了几个月的Python,并希望通过一个真实的迷你项目来进一步学习.

I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.

对于此项目,我想从Twitter流API收集推文并将其存储为json格式(尽管您可以选择仅保存诸如status.text,status.id之类的关键信息,但我被告知最好的方法是保存所有数据,然后进行处理.但是,添加on_data()后,代码将停止工作.请问有人可以协助吗?我也乐于接受有关存储/处理推文的最佳方法的建议!我的最终目标是能够根据人口统计变量(例如,国家/地区,用户个人资料年龄等)和特定品牌(例如,Apple,HTC,Samsung)的情绪来跟踪推文.

For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). However, with the addition of the on_data() the code ceases to work. Would someone be able to to assist please? I'm also open to suggestions on the best way to store/process tweets! My end goal is to be able to track tweets based on demographic variables (e.g., country, user profile age, etc) and the sentiment of particular brands (e.g., Apple, HTC, Samsung).

此外,我还想尝试通过位置AND关键字过滤推文.我已经从如何添加代码中改编了代码分别过滤tweepy模块的位置过滤器.但是,虽然在有几个关键字的情况下它可以工作,但是在关键字数量增加时,它就停止了.我认为我的代码效率低下.有更好的方法吗?

In addition, I would also like to try filtering tweets by location AND keywords. I've adapted the code from How to add a location filter to tweepy module separately. However, while it works when there are a few keywords, it stops when the number of keywords grows. I presume my code is inefficient. Is there a better way of doing it?

### code to save tweets in json###
import sys
import tweepy
import json

consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print status.text

    def on_data(self, data):
        json_data = json.loads(data)
        file.write(str(json_data))

    def on_error(self, status_code):
        print >> sys.stderr, 'Encountered error with status code:', status_code
        return True # Don't kill the stream

    def on_timeout(self):
        print >> sys.stderr, 'Timeout...'
        return True # Don't kill the stream

sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])

推荐答案

在重新阅读您的原始问题时,我意识到您提出的是一些较小问题的 lot .我会在这里尝试回答其中的大多数问题,但实际上可能有人会问一个关于SO的单独问题.

In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.

  • 为什么在添加 on_data 时会中断?

没有看到实际的错误,很难说.直到我重新生成了用户/访问密钥,它才真正对我不起作用,我会尝试一下.

Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.

有些事情我可能会做的与您的回答有所不同.

There are a few things I might do differently than your answer.

tweets是全局列表.这意味着,如果您有多个StreamListeners(即在多个线程中),则 any 流侦听器收集的条推文将添加到此列表中.这是因为Python中的列表是指内存中的位置-如果这令人困惑,这是我的意思的基本示例:

tweets is a global list. This means that if you have multiple StreamListeners (i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:

>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]

请注意,即使您认为在foo后面附加7,foobar实际上是指同一个东西(因此更改一个都将更改).

Notice that even though you thought appended 7 to foo, foo and bar actually refer to the same thing (and therefore changing one changes both).

如果您打算这样做,这是一个很好的解决方案.但是,如果您打算将推文与其他听众分开,则可能会非常头痛.我个人会像这样构造我的课程:

If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.list_of_tweets = []

这会将推文列表更改为仅在您的班级范围内.另外,我认为将属性名称从self.save_file更改为self.list_of_tweets是适当的,因为您还命名了将推文附加到save_file文件.尽管这不会严格引起错误,但是让我感到困惑的是self.save_file是列表,而save_file是文件.它可以帮助将来的您和其他任何读取您的代码的人弄清楚到底做什么/是什么. 有关变量命名的更多信息.

This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file. Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.

在我的评论中,我提到您不应使用file作为变量名. file是Python的内置函数,该函数构造类型为file的新对象.从技术上讲,您可以覆盖它,但这是一个非常糟糕的主意.有关更多内置函数,请参见 Python文档.

In my comment, I mentioned that you shouldn't use file as a variable name. file is a Python builtin function that constructs a new object of type file. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.

  • 如何过滤多个关键字的结果?

在这种搜索类型中,OR所有关键字都在一起,:

All keywords are OR'd together in this type of search, source:

sapi.filter(track=['twitter', 'python', 'tweepy'])

这意味着它将获取包含"twitter","python"或"tweepy"的推文.如果您希望所有词都使用联合(AND),则必须通过对要搜索的所有词的列表检查一条推文来进行后处理.

This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.

  • 如何根据位置和关键字过滤结果?

我实际上才意识到您做了

I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:

sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])

  • 存储/处理推文的最佳方法是什么?
    • What is the best way to store/process tweets?
    • 这取决于您要收集的数量.我非常喜欢数据库,尤其是如果您打算对很多推文进行情感分析.收集数据时,您应该收集所需的东西.这意味着,当您将结果保存到数据库中的任何位置时,都应使用on_data方法,应从JSON中提取重要部分,而不保存其他任何内容.例如,如果您要查看品牌,国家和时间,则只需考虑以下三点:不要保存该tweet的整个JSON转储,因为它只会占用不必要的空间.

      That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.

      这篇关于如何通过tweepy将流式推文保存在json中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆