在Bigram Frequency,Python的基础上替换单词 [英] Replace Words on the basis of Bigram Frequency,Python

查看:300
本文介绍了在Bigram Frequency,Python的基础上替换单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个系列类型的对象,我必须应用一个函数,它使用bigrams来纠正单词,以防与另一个单词发生匹配。我创建了一个bigrams列表,按照频率对它进行排序(最高优先)并将其称为fdist。

  bigrams = [b for text2 for b in zip(l.split()[: -  1], l.split()[1:])] 
freq = nltk.FreqDist(bigrams)#computes发生的频率
fdist = freq.keys()#根据频率排序

接下来,我创建了一个接受每行(或句子,列表对象)并使用二元论决定是否进一步纠正。

  def bigram_corr(line):#function with input line(sentence)
words = line.split()#将单词拆分为单词
(word1,word2,zip [单词[: - 1],单词[1:]):#每次生成2个单词1,2 2,3 3,4等如果(word2 == j)和(jf.levenshtein_distance(word1,i)<3):#if第二个字都是匹配的,并且第一个单词处于2或1的编辑距离,用最高出现的bigram替换单词
word1 = i #replace
return word1#return word b $ b

问题是整个句子只返回一个单词,例如:

Lts go twards the east is被替换为让我们 。它看起来,进一步的迭代无法正常工作。

word1的for循环,word2以这种方式工作:
在第一次迭代中Lts去,最终将由let发生替换为let在第二次迭代中走向



走向。



迭代......等等。



有一个小错误,我无法弄清楚,请帮助。

解决方案

听起来你正在做 word1 = i ,期望这会修改字的内容 。但是这不会发生。如果你想修改 words ,你必须直接这样做。使用枚举跟踪 word1 的索引。



<正如2rs2ts指出的那样,你很早就回来了。如果您希望内部循环在找到第一个好替换时终止, break ,而不是返回。

  def bigram_corr(line):#function with input line(sentence)
words = line.split()#将单词划分为单词
,用于枚举(zip(words [: - 1],words [1:]))中的idx,(word1,word2):
for i,j在fdist中:#iterate over bigrams
if(word2 == j)和(jf.levenshtein_distance(word1,i)<3):#if这两个匹配的第二个单词,并且第一个单词在编辑距离为2或1,用最高出现的bigram替换单词
words [idx] = i
break
return.join(words)


I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist .

bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq

Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not.

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
               word1=i #replace
               return word1 #return word

The problem is that only a single word is returned for an entire sentence , eg :
"Lts go twards the east is" replaced by lets . It looks that further iterations arent working.
The for loop for word1, word2 works this way : "Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go"

"go towards" in 2nd iteration.

"towards the" in 3rd iteration.. and so on.

There is a minor error which i cant figure out , please help.

解决方案

Sounds like you're doing word1 = i with the expectation that this will modify the contents of words. But this won't happen. If you want to modify words, you'll have to do so directly. Use enumerate to keep track of word1's index.

As 2rs2ts pointed out, you're returning early. If you want the inner loop to terminate once you find the first good replacement, break instead of returning. Then return at the end of the function.

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
                words[idx] = i
                break
    return " ".join(words)

这篇关于在Bigram Frequency,Python的基础上替换单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆