来自python中文本的n-grams [英] n-grams from text in python
问题描述
我以前的帖子的更新,其中有一些更改:
假设我有100条推文.
在这些推文中,我需要提取:1)食物名称,和2)饮料名称.我还需要为每种提取物附加类型(饮料或食物)和ID号(每个项目都有唯一的ID).
我已经有了一个包含名称,类型和ID编号的词典:
An update to my previous post, with some changes:
Say that I have 100 tweets.
In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.
I already have a lexicon with names, type and id-number:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
推文示例:
在对"tweet_1"进行各种处理之后,我有以下句子:
After various processing of "tweet_1" I have this sentences:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
我请求的输出(可以是 list 以外的其他 type ):
My requested output (can be other type than list):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
重要的是,输出应该不提取ngram(n> 1)内的unigram:
It's important that the output should NOT extract unigrams within ngrams (n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
理想情况下,我希望能够在lemmatize()和pos_tag()之前等各种nltk过滤器中运行我的语句,以获取如下所示的输出.但是,使用这种正则表达式解决方案,如果我这样做,那么所有单词都将被拆分为unigram,或者它们将从字符串"coca cola"中生成1 unigram和1 bigram,这将生成我不想拥有的输出(如上面的示例). 理想的输出(同样,输出的 type 不重要):
Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above). The ideal output (again the type of the output is not important):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]
推荐答案
可能不是最有效的解决方案,但这肯定会让您入门-
May not be the most efficient solution, but this will definitely get you started -
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')
print(chunks)
输出
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
说明
lexicon_list = list(lexicon.keys())
提取需要搜索的短语列表,并按长度对它们进行排序(以便首先找到更大的块)
lexicon_list = list(lexicon.keys())
takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)
输出是dict
的列表,其中每个字典具有list
值.
The output is a list of dict
, where each dict has list
values.
这篇关于来自python中文本的n-grams的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!