从python中的推文中提取n-gram [英] extracting n-grams from tweets in python

查看:427
本文介绍了从python中的推文中提取n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有100条推文.
在这些推文中,我需要提取:1)食物名称,2)饮料名称.

Say that I have 100 tweets.
In those tweets, I need to extract: 1) food names, and 2) beverage names.

推文示例:

昨天我有可口可乐和午餐的热狗,还有一些香蕉片去沙漠.我喜欢可乐,但是香蕉片甜点中的香蕉已经成熟了."

"Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"

我要处置两个词典.一个带有食物名称,另一个带有饮料名称.

I have to my disposal two lexicons. One with food names, and one with beverage names.

食物名称词典中的示例:
热狗"
香蕉"
香蕉分割"

Example in food names lexicon:
"hot dog"
"banana"
"banana split"

饮料名称词典中的示例:
可乐"
可乐"
可口可乐"

Example in beverage names lexicon:
"coke"
"cola"
"coca cola"

我应该能够提取的内容:

[[[[可口可乐",饮料"],[热狗",食物"],[香蕉拼盘",食物"]],
[[可乐",饮料"],[香蕉",食物"],[香蕉分割",食物"]]]

[[["coca cola", "beverage"], ["hot dog", "food"], ["banana split", "food"]],
[["coke", "beverage"], ["banana", "food"], ["banana split", "food"]]]

词典中的名称可以是1-5个字长.如何使用词汇从推文中提取n-gram?

The names in the lexicons can be 1-5 word(s) long. How do I go about extracting n-grams from the tweets, using my lexicons?

推荐答案

不确定到目前为止您是否尝试过,下面是在nltkdict()

Not sure what you have tried so far, below is a solution using ngrams in nltk and dict()

from nltk import ngrams

tweet = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"

# Your lexicons
lexicon_food = ["hot dog", "banana", "banana split"]
lexicon_beverage = ["coke", "cola", "coca cola"]
lexicon_dict = {x: [x, 'Food'] for x in lexicon_food}
lexicon_dict.update({x: [x, 'Beverage'] for x in lexicon_beverage})

# Function to extract lexicon items
def extract(g, lex):
    if ' '.join(g) in lex.keys():
        return lex.get(' '.join(g))
    elif g[0] in lex.keys():
        return lex.get(g[0])
    else:
        pass

# Your task
out = [[extract(g, lexicon_dict) for g in ngrams(sentence.split(), 2) if extract(g, lexicon_dict)] 
        for sentence in tweet.replace(',', '').lower().split('.')]
print(out)

输出:

[[['coca cola', 'Beverage'], ['cola', 'Beverage'], ['hot dog', 'Food']], 
 [['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]


方法2 (避免使用可口可乐"和可乐")


Approach 2 (Avoid "coca cola" and "cola")

def extract2(sentence, lex):
    extracted_words = []
    words = sentence.split()
    i = 0
    while i < len(words):
        if ' '.join(words[i:i+2]) in lex.keys():
            extracted_words.append(lex.get(' '.join(words[i:i+2])))
            i += 2
        elif words[i] in lex.keys():
            extracted_words.append(lex.get(words[i]))
            i += 1
        else:
            i += 1
    return extracted_words

out = [extract2(s, lexicon_dict) for s in tweet.replace(',', '').lower().split('.')]
print(out)

输出:

[[['coca cola', 'Beverage'], ['hot dog', 'Food']], 
 [['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]

请注意,此处不需要nltk.

这篇关于从python中的推文中提取n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆