来自python中文本的n-grams [英] n-grams from text in python

查看:120
本文介绍了来自python中文本的n-grams的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以前的帖子的更新,其中有一些更改:

假设我有100条推文. 在这些推文中,我需要提取:1)食物名称,和2)饮料名称.我还需要为每种提取物附加类型(饮料或食物)和ID号(每个项目都有唯一的ID).

我已经有了一个包含名称,类型和ID编号的词典:

An update to my previous post, with some changes:

Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.

I already have a lexicon with names, type and id-number:

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}


推文示例:

在对"tweet_1"进行各种处理之后,我有以下句子:

After various processing of "tweet_1" I have this sentences:

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

我请求的输出(可以是 list 以外的其他 type ):

My requested output (can be other type than list):

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["banana split"], ["food", "f_567"]],
  [["ice cream"], ["food", "f_789"]]],

 "tweet_id_1",,
 [[["coca cola"], ["drink", "d_234"]],
  [["banana"], ["food", "f_456"]]]]

重要的是,输出应该提取ngram(n> 1)内的unigram:

It's important that the output should NOT extract unigrams within ngrams (n>1):

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana split"], ["food", "f_567"]],
  [["banana"], ["food", "f_456"]],
  [["ice cream"], ["food", "f_789"]],
  [["cream"], ["food", "f_678"]]],

 "tweet_id_1",
 [[["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana"], ["food", "f_456"]]]]



理想情况下,我希望能够在lemmatize()和pos_tag()之前等各种nltk过滤器中运行我的语句,以获取如下所示的输出.但是,使用这种正则表达式解决方案,如果我这样做,那么所有单词都将被拆分为unigram,或者它们将从字符串"coca cola"中生成1 unigram和1 bigram,这将生成我不想拥有的输出(如上面的示例). 理想的输出(同样,输出的 type 不重要):

Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above). The ideal output (again the type of the output is not important):

["tweet_id_1",
 [[[("dr pepper", "NN")], ["drink", "d_124"]],
  [[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana split", "NN")], ["food", "f_567"]],
  [[("ice cream", "NN")], ["food", "f_789"]]],

 "tweet_id_1",
 [[[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana", "NN")], ["food", "f_456"]]]]

推荐答案

可能不是最有效的解决方案,但这肯定会让您入门-

May not be the most efficient solution, but this will definitely get you started -

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

chunks = []

for sentence in sentences:
    for lex in lexicon_list:
        if lex in sentence:
                chunks.append({lex: list(lexicon[lex].values()) })
                sentence = sentence.replace(lex, '')

print(chunks)

输出

[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]

说明

lexicon_list = list(lexicon.keys())提取需要搜索的短语列表,并按长度对它们进行排序(以便首先找到更大的块)

lexicon_list = list(lexicon.keys()) takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)

输出是dict的列表,其中每个字典具有list值.

The output is a list of dict, where each dict has list values.

这篇关于来自python中文本的n-grams的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆