用于组合常见搭配的NLP流程 [英] NLP process for combining common collocations
问题描述
我有一个语料库,我在R中使用了tm包(并且还在python中的NLTK中镜像了相同的脚本).我正在使用unigram,但是希望使用某种解析器将通常位于同一位置的单词组合起来就好像一个单词一样,即我不想再在我的个人中看到"New"和"York"数据集在一起出现时的数据集,并看到这对表示为纽约"的单词就好像是一个单词,并与其他字母组合在一起.
I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.
将有意义的常见n-gram转换为与unigram相同的立足点的过程称为什么?没事吗最后,tm_map
的外观如何?
What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map
look like for this?
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
和/或在python中?
And/or in python?
推荐答案
我最近有一个相似的问题,并搭配使用
I recently had a similar question and played around with collocations
这是我用来识别成对并置单词的解决方案:
This was the solution I chose to identify pairs of collocated words:
from nltk import word_tokenize
from nltk.collocations import *
text = <a long text read in as string string>
tokenized_text = word_tokenize(text)
bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)
这篇关于用于组合常见搭配的NLP流程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!