跟踪实时推文流中的关键字 [英] Tracking keywords in a live stream of tweets

查看:25
本文介绍了跟踪实时推文流中的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我安装并试用了 tweepy,我现在正在使用以下功能:

I installed and tried out tweepy, I am using the following function right now:

来自 API 参考

API.public_timeline()

API.public_timeline()

返回最近的 20 个状态设置了自定义用户图标的非受保护用户.公众时间线缓存了 60 秒,因此请求它的频率比这是一种资源浪费.

Returns the 20 most recent statuses from non-protected users who have set a custom user icon. The public timeline is cached for 60 seconds so requesting it more often than that is a waste of resources.

但是,我想从完整的实时流中提取与某个正则表达式匹配的所有推文.我可以将 public_timeline() 放在 while True 循环中,但这可能会遇到速率限制问题.无论哪种方式,我都不认为它可以涵盖所有当前的推文.

However, I want to do extract all tweets that match a certain regular expression from the complete live stream. I could put public_timeline() inside a while True loop but that would probably run into problems with rate limiting. Either way, I don't really think it can cover all current tweets.

这怎么可能?如果不是所有推文,那么我想提取尽可能多的匹配某个关键字的推文.

How could that be done? If not all tweets, then I want to extract as many tweets that match a certain keyword.

推荐答案

流 API 正是您想要的.我使用了一个名为 tweetstream 的库.这是我的基本聆听功能:

The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

我有一段时间没看,但我很确定这个库只是访问示例流(而不是 firehose).哈.

I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.

编辑添加:你说你想要完整的直播",又名消防水管.这在财政和技术上都是昂贵的,只有非常大的公司才能拥有它.查看文档,您会发现样本基本上具有代表性.

Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

这篇关于跟踪实时推文流中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆