nltk语料库不包含“好”吗? [英] nltk words corpus does not contain "okay"?

查看:81
本文介绍了nltk语料库不包含“好”吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

NLTK单词语料库没有短语 okay, ok, Okay?

The NLTK word corpus does not have the phrase "okay", "ok", "Okay"?

> from nltk.corpus import words
> words.words().__contains__("check")
> True

> words.words().__contains__("okay")
> False

> len(words.words())
> 236736

有什么想法吗?

推荐答案

TL; DR



TL;DR

from nltk.corpus import words
from nltk.corpus import wordnet 

manywords = words.words() + wordnet.words() 



< hr>



来自文档 nltk.corpus.words 是单词,单词列表来自 http://en.wikipedia.org/wiki/Words_(Unix)


In Long

From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)

在Unix中,您可以执行以下操作:

Which in Unix, you can do:

ls /usr/share/dict/

并阅读自述文件:

$ cd /usr/share/dict/
/usr/share/dict$ cat README
#   @(#)README  8.1 (Berkeley) 6/5/93
# $FreeBSD$

WEB ---- (introduction provided by jaw@riacs) -------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier.  The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases.  The wordlist makes a dandy 'grep' victim.

     -- James A. Woods    {ihnp4,hplabs}!ames!jaw    (or jaw@riacs)

Country names are stored in the file /usr/share/misc/iso3166.


FreeBSD Maintenance Notes ---------------------------------------------

Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.

A few words have been removed because their spellings have depreciated.
This list of words includes:
    corelation (and its derivatives)    "correlation" is the preferred spelling
    freen               typographical error in original file
    freend              archaic spelling no longer in use;
                    masks common typo in modern text

--

A list of technical terms has been added in the file 'freebsd'.  This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation.  It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.

由于它是 234,936 的固定列表,所以肯定会有单词

Since it's a fixed list of 234,936, there are bound to be words that don't exist in that list.

如果需要扩展单词列表,则可以使用 nltk.corpus.wordnet.words()。

If you need to extend your word list, you can add to the list using the words from WordNet using nltk.corpus.wordnet.words().

最有可能的是,您需要的是足够大的文本语料库,例如维基百科转储然后将其标记化并提取所有唯一的单词。

Most probably, all you need is a large enough corpus of text, e.g. Wikipedia dump and then tokenize it and extract all unique words.

这篇关于nltk语料库不包含“好”吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆