过滤从状态/过滤器收到的推文(流API) [英] filtering of tweets received from statuses/filter (streaming API)

查看:95
本文介绍了过滤从状态/过滤器收到的推文(流API)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我跟踪了N个不同的关键字(为简单起见,让N = 3).因此,在GET状态/过滤器中,我将在"track"参数中给出3个关键字.

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.

现在,我将收到的推文可以来自我提到的3个关键字中的任何一个. 问题是我想解决哪个推文对应哪个关键字. 也就是说,tweet和关键字(在"track"参数中提到)之间的映射.

Now the tweets that i will be receiving can be from ANY of the 3 keywords that i mentioned. The problem is that i want to resolve as to which tweet corresponds to which keyword. i.e. mapping between tweets and the keyword(s) (that are mentioned in the "track" argument).

显然,如果不对收到的推文进行任何处理,就无法做到这一点.

Apparently, there is no way to do this without doing any processing on the tweets received.

所以我想知道进行此处理的最佳方法是什么? 在推文的文本中搜索关键字?不区分大小写怎么办?如果同一关键字中有多个单词,例如"Katrina Kaif",该怎么办?

So i was wondering what is the best way to do this processing? Search for keywords in the text of the tweet? What about case-insensitive? What about when multiple words are there in same keyword, e.g: "Katrina Kaif" ?

我目前正在尝试制定一些正则表达式...

I am currently trying to formulate some regular expression...

我当时在想,最好的方法是使用与状态/过滤器API相同的逻辑(正则表达式等).如何知道Twitter API状态/过滤器本身使用什么逻辑将推文与关键字进行匹配?

I was thinking the BEST way would to use the same logic (regular expressions etc.) as is used originally be statuses/filter API. How to know what logic is used by Twitter API statuses/filter itself to match tweets to the keywords ?

建议?帮助吗?

P.S .:我正在使用Python,Tweepy,Regex,MongoDb/Apache S4(用于分布式计算)

P.S.: I am using Python, Tweepy, Regex, MongoDb/Apache S4 (for distributed computing)

推荐答案

我想到的第一件事是为每个关键字创建一个单独的流,并在一个单独的线程中启动它,如下所示:

The first thing coming into my mind is to create a separate stream for every keyword and start it in a separate thread, like this:

from threading import Thread
import tweepy


class StreamListener(tweepy.StreamListener):
    def __init__(self, keyword, api=None):
        super(StreamListener, self).__init__(api)
        self.keyword = keyword

    def on_status(self, tweet):
        print 'Ran on_status'

    def on_error(self, status_code):
        print 'Error: ' + repr(status_code)
        return False

    def on_data(self, data):
        print self.keyword, data
        print 'Ok, this is actually running'


def start_stream(auth, track):
    tweepy.Stream(auth=auth, listener=StreamListener(track)).filter(track=[track])


auth = tweepy.OAuthHandler(<consumer_key>, <consumer_secret>)
auth.set_access_token(<key>, <secret>)

track = ['obama', 'cats', 'python']
for item in track:
    thread = Thread(target=start_stream, args=(auth, item))
    thread.start()

如果您仍然想在单个流中通过关键字自己区分推文,请参见有关Twitter如何使用track请求参数的一些信息.在某些情况下可能会引起问题.

If you still want to distinguish tweets by keywords by yourself in a single stream, here's some info on how twitter uses track request parameter. There are some edge cases that could cause problems.

希望有帮助.

这篇关于过滤从状态/过滤器收到的推文(流API)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆