根据Python中的tweet文本对tweet的类型(tweet/retweet/mention)进行分类 [英] Classify type of tweet (tweet/retweet/mention) based on tweet text in Python

查看:363
本文介绍了根据Python中的tweet文本对tweet的类型(tweet/retweet/mention)进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过几个示例,我已经能够创建一个简单的Python脚本来解析Twitter Streaming API的JSON输出,并为每条推文打印出screen_nametext.我想修改我的代码,将每条推文也归类为以下之一:

Pulling from a couple of different examples, I've been able to create a simple Python script that parses the JSON output from the Twitter Streaming API, and prints out the screen_name and text for each tweet. I would like to modify my code to also classify each tweet as one of the following:

(1)转推->在"tweet"文本列中有一个"RT @anyusername"

(1) Retweet --> There is an "RT @anyusername" somewhere in the tweet text column

(2)提及-> tweet列中有一个"@anyusername",但没有任何"RT @anyusername"

(2) Mention --> There is an "@anyusername" but no "RT @anyusername" in the tweet column

(3) Tweet ->在tweet列中没有"RT @anyusername"或任何"@anyusername"

(3) Tweet --> There is no "RT @anyusername" nor any "@anyusername" in the tweet column

我可以使用以下公式在Excel中执行此操作,但是我仍然可以在Python中找到答案.

I can do this in Excel with the following formula, but I can figure it out in Python yet.

=IF(IFERROR(FIND("RT @",B2)>0,"False"),"Retweet",IF(IFERROR(FIND("@",B2)>0,"False"),"Mention","Tweet"))

现有代码

Existing Code

import json
import sys
from csv import writer

with open(sys.argv[1]) as in_file, \
    open(sys.argv[2], 'w') as out_file:
    print >> out_file, 'tweet_author, tweet_text, tweet_type'
    csv = writer(out_file)

    for line in in_file:
        try:
            tweet = json.loads(line)
        except:
            pass

        tweet_text = tweet['text']

        row = (
        tweet['user']['screen_name'],
        tweet_text
        )
        values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
        csv.writerow(values)

推荐答案

我这里没有任何python解释器,但应该与此类似:

I don't have any python interpreter here, but it should be something similar to this:

import re


def url_match(tweet):
    match = re.match(r'RT\s@....+', tweet)
    if match:
        return "RT"
    else:
        match = re.match(r'@....+', tweet)
        if match:
           return "mention"
        else
           return "tweet"

注意:这将适用于此分类,但是如果您要检索用户名(即@USERNAME),则必须对此稍作调整.

Note: this will work for this classification, but if you want to retrieve usernames i.e. @USERNAME you will have to tweak this a little more.

这篇关于根据Python中的tweet文本对tweet的类型(tweet/retweet/mention)进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆