从语料库中删除标点符号时出错 [英] Error when stripping punctuation from corpus
问题描述
预先感谢您的帮助.我正在尝试编写一个脚本,该脚本将查看语料库,找到所有三元组并将它们及其相对频率打印到一个csv文件中.我已经走得很远了,但是一直遇到一个问题.由于单引号,它认为连词是两个词,因此将doesn't
分解为doesn t
,从而弄乱了三字母组的计数.我正在尝试通过从原始变量中删除所有标点符号来解决该问题,我相信这只是一个长字符串,其中包含来自我的语料库的所有带有此行的文本:
Thank you in advance for your help. I'm trying to write a script that will look at a corpus, find all trigrams and print those along with their relative frequencies into a csv file. I have gotten pretty far but keep running into one problem. It thinks conjunctions are two words because of the apostrophe so it splits doesn't
into doesn t
, which messes up the trigram count. I am trying to solve that problem by removing all punctuation from the raw variable, which I believe is just one long string that contains all of the text from my corpus with this line:
raw = raw.translate(None, string.punctuation)
但是这给了我一个错误:NameError: name 'string' is not defined
But that gives me an error that says: NameError: name 'string' is not defined
但是我不认为在那样使用时必须定义字符串吗?这是否意味着raw不是字符串?怎么解决呢?
But I didn't think string had to be defined when used like that? Does that mean raw is not a string? How can solve this?
#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
tokenizer = RegexpTokenizer(r'\w+')
raw = speeches.raw().lower()
raw = raw.translate(None, string.punctuation)
finalwords = raw.encode['ascii','xmlcharrefreplace']
tokens = tokenizer.tokenize(finalwords)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 40
numwords = len(finalwords)
print "Words in corpus:"
print numwords
c = csv.writer(open("TPNngrams.csv", "wb"))
for k,v in fdist.items():
if v > minscore:
rf = Decimal(v)/Decimal(numwords)
firstword, secondword, thirdword = k
trigram = firstword + " " + secondword + " " + thirdword
results = trigram,v,rf
c.writerow(results)
print firstword, secondword, thirdword, v, rf
print "All done."
推荐答案
如果要在单词中保留撇号,则是另一种选择
Another option if you want to keep the apostrophes in the words
您不一定必须将撇号分开.只需尝试更改令牌化程序上的正则表达式以包括撇号即可:
you don't necessarily have to split the apostrophes out. Just try changing your regular expression on your tokenizer to include apostrophes:
tokenizer = RegexpTokenizer(r'\w+')
尝试:
tokenizer = RegexpTokenizer(r'(\w|')+')
或者也可以在这里查看此响应,它可能会更好:
or also take a look at this response here it might be better:
这篇关于从语料库中删除标点符号时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!