标记化文本中ngram(字符串)的出现频率 [英] Frequency of ngrams (strings) in tokenized text

查看：26 发布时间：2021/5/30 18:54:08 string python-3.x list nltk n-gram

本文介绍了标记化文本中ngram(字符串)的出现频率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组唯一的ngram(称为ngramlist的列表)和ngram标记化的文本(称为ngrams的列表).我想构造一个新的向量freqlist，其中freqlist的每个元素都是ngram的分数，它等于ngramlist的该元素.我写了下面的代码，给出了正确的输出，但是我想知道是否有一种优化它的方法:

I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]

我想nltk或其他地方有一个函数可以更快地完成此操作，但是我不确定哪个函数.

I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.

谢谢！

出于什么价值，ngrams作为 nltk.util.ngrams 和 ngramlist 只是由所有找到的ngram组成的列表.

for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.

Edit2:

这里是可复制的代码来测试频率列表行(其余的代码并不是我真正关心的)

Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)

from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd

articles = ['New York City','Moscow','Beijing']
tokenizer  = nltk.tokenize.TreebankWordTokenizer()

data={'article':[],'treebank_tokenizer':[]}
for article in articles:
    data['article' ].append(wikipedia.page(article).content)
    data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))

df=pd.DataFrame(data)

df['ngrams-3']=df['treebank_tokenizer'].map(
    lambda x: [' '.join(t) for t in ngrams(x,3)])

ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))

df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])

推荐答案

您可能可以通过预先计算一些数量并使用

You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

当然，您每次检查 ngram 时都不需要遍历 ngrams .一遍 ngrams 将使此算法为 O(n)，而不是 O(n ²)现在.请记住，较短的代码不一定是更好或更有效的代码:

You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n²) one you have now. Remember, shorter code is not necessarily better or more efficient code:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

要正确使用此功能，您必须编写一个 def 函数而不是一个 lambda :

To use this function properly, you would have to write a def function instead of a lambda:

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

这篇关于标记化文本中ngram(字符串)的出现频率的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

标记化文本中ngram(字符串)的出现频率 [英] Frequency of ngrams (strings) in tokenized text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

标记化文本中ngram(字符串)的出现频率 [英] Frequency of ngrams (strings) in tokenized text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭