NLTK(Python)中的语料库和词典有什么区别 [英] What is the difference between corpus and lexicon in NLTK (python)

查看:1038
本文介绍了NLTK(Python)中的语料库和词典有什么区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以告诉我NLTK中的 Corpora 语料库词典之间的区别吗?

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?

什么是电影数据集?

什么是 Wordnet ?

推荐答案

Corpora 是语料库的复数.

Corpus 本质上是指正文,在自然语言处理(NLP)的上下文中,它是指正文.

Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text.

(来源: https://www.google.com.sg/search?q = corpora )

词典是一个词汇表,单词列表和字典(来源: https://www.google.com.sg/search?q=lexicon )

Lexicon is a vocabulary, a list of words, a dictionary (source: https://www.google.com.sg/search?q=lexicon)

在NLTK中,任何词典都被视为语料库,因为单词列表也是正文.例如.在NLTK语料库API中可以找到停用词列表:

In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. E.g. a list of stopwords can be found in NLTK corpus API:

>>> from nltk.corpus import stopwords
>>> print stopwords.words('english')
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']


NLTK中的电影评论数据集(通常称为Movie Reviews Corpus)是具有情感极性分类的2k电影评论的文本数据集(来源: http://www.nltk.org/book/ch02.html )


The movie review dataset in NLTK (canonically known as Movie Reviews Corpus) is a text dataset of 2k movie reviews with sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)

它通常用于指导NLP和情感分析的教程,请参见 http://www.nltk.org/book/ch06.html nltk NaiveBayesClassifier培训进行情感分析

And it is often used for tutorial purposes for introduction to NLP and sentiment analysis, see http://www.nltk.org/book/ch06.html and nltk NaiveBayesClassifier training for sentiment analysis

WordNet 英语语言的词汇数据库(就像具有词对词关系的词典/词典一样)(来源:

WordNet is lexical database for the English language (it's like a lexicon/dictionary with word-to-word relations) (source: https://wordnet.princeton.edu/).

在NLTK中,它包含开放式多语言WordNet( http://compling.hss.ntu .edu.sg/omw/),使您可以查询其他语言的单词.

In NLTK, it incorporates the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/) that allows you to query the words in other languages.

由于它也是单词列表(在这种情况下,还包括关系,引理,POS等许多其他内容),因此也可以使用NLTK中的nltk.corpus调用它.

Since it is also a list of words (in this case with many other things included, relations, lemmas, POS, etc.), it's also invoked using nltk.corpus in NLTK.

在NLTK中使用词网的规范习惯是这样的:

The canonical idiom to use the wordnet in NLTK is as such:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


了解/学习NLP术语和基础知识的最简单方法是阅读NLTK书中的以下教程: http://www.nltk.org/book/

这篇关于NLTK(Python)中的语料库和词典有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆