来自python中文本的n-grams [英] n-grams from text in python

查看：120 发布时间：2020/5/18 0:57:33 python regex nlp nltk n-gram

本文介绍了来自python中文本的n-grams的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我以前的帖子的更新，其中有一些更改:

假设我有100条推文. 在这些推文中，我需要提取:1)食物名称，和2)饮料名称.我还需要为每种提取物附加类型(饮料或食物)和ID号(每个项目都有唯一的ID).

我已经有了一个包含名称，类型和ID编号的词典:

An update to my previous post, with some changes:

Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.

I already have a lexicon with names, type and id-number:

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

推文示例:

在对"tweet_1"进行各种处理之后，我有以下句子:

After various processing of "tweet_1" I have this sentences:

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

我请求的输出(可以是 list 以外的其他 type ):

My requested output (can be other type than list):

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["banana split"], ["food", "f_567"]],
  [["ice cream"], ["food", "f_789"]]],

 "tweet_id_1",,
 [[["coca cola"], ["drink", "d_234"]],
  [["banana"], ["food", "f_456"]]]]

重要的是，输出应该不提取ngram(n> 1)内的unigram:

It's important that the output should NOT extract unigrams within ngrams (n>1):

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana split"], ["food", "f_567"]],
  [["banana"], ["food", "f_456"]],
  [["ice cream"], ["food", "f_789"]],
  [["cream"], ["food", "f_678"]]],

 "tweet_id_1",
 [[["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana"], ["food", "f_456"]]]]

理想情况下，我希望能够在lemmatize()和pos_tag()之前等各种nltk过滤器中运行我的语句，以获取如下所示的输出.但是，使用这种正则表达式解决方案，如果我这样做，那么所有单词都将被拆分为unigram，或者它们将从字符串"coca cola"中生成1 unigram和1 bigram，这将生成我不想拥有的输出(如上面的示例). 理想的输出(同样，输出的 type 不重要):

Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above). The ideal output (again the type of the output is not important):

["tweet_id_1",
 [[[("dr pepper", "NN")], ["drink", "d_124"]],
  [[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana split", "NN")], ["food", "f_567"]],
  [[("ice cream", "NN")], ["food", "f_789"]]],

 "tweet_id_1",
 [[[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana", "NN")], ["food", "f_456"]]]]

推荐答案

可能不是最有效的解决方案，但这肯定会让您入门-

May not be the most efficient solution, but this will definitely get you started -

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

chunks = []

for sentence in sentences:
    for lex in lexicon_list:
        if lex in sentence:
                chunks.append({lex: list(lexicon[lex].values()) })
                sentence = sentence.replace(lex, '')

print(chunks)

输出

[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]

说明

lexicon_list = list(lexicon.keys())提取需要搜索的短语列表，并按长度对它们进行排序(以便首先找到更大的块)

lexicon_list = list(lexicon.keys()) takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)

输出是dict的列表，其中每个字典具有list值.

The output is a list of dict, where each dict has list values.

这篇关于来自python中文本的n-grams的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

来自python中文本的n-grams [英] n-grams from text in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

来自python中文本的n-grams [英] n-grams from text in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭