从语料库中删除标点符号时出错 [英] Error when stripping punctuation from corpus

查看:171
本文介绍了从语料库中删除标点符号时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

预先感谢您的帮助.我正在尝试编写一个脚本,该脚本将查看语料库,找到所有三元组并将它们及其相对频率打印到一个csv文件中.我已经走得很远了,但是一直遇到一个问题.由于单引号,它认为连词是两个词,因此将doesn't分解为doesn t,从而弄乱了三字母组的计数.我正在尝试通过从原始变量中删除所有标点符号来解决该问题,我相信这只是一个长字符串,其中包含来自我的语料库的所有带有此行的文本:

Thank you in advance for your help. I'm trying to write a script that will look at a corpus, find all trigrams and print those along with their relative frequencies into a csv file. I have gotten pretty far but keep running into one problem. It thinks conjunctions are two words because of the apostrophe so it splits doesn't into doesn t, which messes up the trigram count. I am trying to solve that problem by removing all punctuation from the raw variable, which I believe is just one long string that contains all of the text from my corpus with this line:

    raw = raw.translate(None, string.punctuation)

但是这给了我一个错误:NameError: name 'string' is not defined

But that gives me an error that says: NameError: name 'string' is not defined

但是我不认为在那样使用时必须定义字符串吗?这是否意味着raw不是字符串?怎么解决呢?

But I didn't think string had to be defined when used like that? Does that mean raw is not a string? How can solve this?

#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt') 
print "Finished importing corpus"
tokenizer = RegexpTokenizer(r'\w+')
raw = speeches.raw().lower()
raw = raw.translate(None, string.punctuation)
finalwords = raw.encode['ascii','xmlcharrefreplace']
tokens = tokenizer.tokenize(finalwords)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 40
numwords = len(finalwords)
print "Words in corpus:" 
print numwords
c = csv.writer(open("TPNngrams.csv", "wb"))
for k,v in fdist.items():
    if v > minscore:
        rf = Decimal(v)/Decimal(numwords)
        firstword, secondword, thirdword = k
        trigram = firstword + " " + secondword + " " + thirdword
        results = trigram,v,rf
        c.writerow(results)
        print firstword, secondword, thirdword, v, rf

print "All done."

推荐答案

如果要在单词中保留撇号,则是另一种选择

Another option if you want to keep the apostrophes in the words

您不一定必须将撇号分开.只需尝试更改令牌化程序上的正则表达式以包括撇号即可:

you don't necessarily have to split the apostrophes out. Just try changing your regular expression on your tokenizer to include apostrophes:

tokenizer = RegexpTokenizer(r'\w+')

尝试:

tokenizer = RegexpTokenizer(r'(\w|')+')

或者也可以在这里查看此响应,它可能会更好:

or also take a look at this response here it might be better:

正则表达式以匹配单词和带有撇号的单词

这篇关于从语料库中删除标点符号时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆