标记化文本中ngram(字符串)的出现频率 [英] Frequency of ngrams (strings) in tokenized text

查看:26
本文介绍了标记化文本中ngram(字符串)的出现频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组唯一的ngram(称为ngramlist的列表)和ngram标记化的文本(称为ngrams的列表).我想构造一个新的向量freqlist,其中freqlist的每个元素都是ngram的分数,它等于ngramlist的该元素.我写了下面的代码,给出了正确的输出,但是我想知道是否有一种优化它的方法:

I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]

我想nltk或其他地方有一个函数可以更快地完成此操作,但是我不确定哪个函数.

I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.

谢谢!

出于什么价值,ngrams作为 nltk.util.ngrams ngramlist 只是由所有找到的ngram组成的列表.

for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.

Edit2:

这里是可复制的代码来测试频率列表行(其余的代码并不是我真正关心的)

Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)

from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd

articles = ['New York City','Moscow','Beijing']
tokenizer  = nltk.tokenize.TreebankWordTokenizer()

data={'article':[],'treebank_tokenizer':[]}
for article in articles:
    data['article' ].append(wikipedia.page(article).content)
    data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))

df=pd.DataFrame(data)

df['ngrams-3']=df['treebank_tokenizer'].map(
    lambda x: [' '.join(t) for t in ngrams(x,3)])

ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))

df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])

推荐答案

您可能可以通过预先计算一些数量并使用

You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

当然,您每次检查 ngram 时都不需要遍历 ngrams .一遍 ngrams 将使此算法为 O(n),而不是 O(n 2 )现在.请记住,较短的代码不一定是更好或更有效的代码:

You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n2) one you have now. Remember, shorter code is not necessarily better or more efficient code:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

要正确使用此功能,您必须编写一个 def 函数而不是一个 lambda :

To use this function properly, you would have to write a def function instead of a lambda:

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

这篇关于标记化文本中ngram(字符串)的出现频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆