在NLTK中使用我自己的语料库而不是movie_reviews语料库进行分类 [英] Using my own corpus instead of movie_reviews corpus for Classification in NLTK
问题描述
我使用以下代码,并通过使用电影评论进行分类NLTK/Python中的语料库
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
输出:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0
我想在nltk中创建自己的文件夹而不是movie_reviews
,并将我自己的文件放入其中.
如果数据的结构与NLTK中的movie_review
语料库完全相同,则有两种方法可以破解"您的方法:>
1.将您的语料库目录放入保存nltk.data
首先检查您的nltk.data
保存在哪里:
>>> import nltk
>>> nltk.data.find('corpora/movie_reviews')
FileSystemPathPointer(u'/home/alvas/nltk_data/corpora/movie_reviews')
然后将目录移动到保存nltk_data/corpora
的位置:
# Let's make a test corpus like `nltk.corpus.movie_reviews`
~$ mkdir my_movie_reviews
~$ mkdir my_movie_reviews/pos
~$ mkdir my_movie_reviews/neg
~$ echo "This is a great restaurant." > my_movie_reviews/pos/1.txt
~$ echo "Had a great time at chez jerome." > my_movie_reviews/pos/2.txt
~$ echo "Food fit for the ****" > my_movie_reviews/neg/1.txt
~$ echo "Slow service." > my_movie_reviews/neg/2.txt
~$ echo "README please" > my_movie_reviews/README
# Move it to `nltk_data/corpora/`
~$ mv my_movie_reviews/ nltk_data/corpora/
在您的python代码中:
>>> import string
>>> from nltk.corpus import LazyCorpusLoader, CategorizedPlaintextCorpusReader
>>> from nltk.corpus import stopwords
>>> my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> mr = my_movie_reviews
>>>
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> for i in documents:
... print i
...
([u'Food', u'fit', u'****'], u'neg')
([u'Slow', u'service'], u'neg')
([u'great', u'restaurant'], u'pos')
([u'great', u'time', u'chez', u'jerome'], u'pos')
(有关更多详细信息,请参见 https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 和中创建了类似的问题在NLTK和Python 和中使用Python NLTK中用于类别分类的语料库
这是可以使用的完整代码:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk
mydir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
I use following code and I get it form Classification using movie review corpus in NLTK/Python
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
output:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0
I want to create my own folder instead of movie_reviews
in nltk, and put my own files inside it.
If you have you data in exactly the same structure as the movie_review
corpus in NLTK, there are two ways to "hack" your way through:
1. Put your corpus directory into where you save the nltk.data
First check where is your nltk.data
saved:
>>> import nltk
>>> nltk.data.find('corpora/movie_reviews')
FileSystemPathPointer(u'/home/alvas/nltk_data/corpora/movie_reviews')
Then move your directory to where the location where nltk_data/corpora
is saved:
# Let's make a test corpus like `nltk.corpus.movie_reviews`
~$ mkdir my_movie_reviews
~$ mkdir my_movie_reviews/pos
~$ mkdir my_movie_reviews/neg
~$ echo "This is a great restaurant." > my_movie_reviews/pos/1.txt
~$ echo "Had a great time at chez jerome." > my_movie_reviews/pos/2.txt
~$ echo "Food fit for the ****" > my_movie_reviews/neg/1.txt
~$ echo "Slow service." > my_movie_reviews/neg/2.txt
~$ echo "README please" > my_movie_reviews/README
# Move it to `nltk_data/corpora/`
~$ mv my_movie_reviews/ nltk_data/corpora/
In your python code:
>>> import string
>>> from nltk.corpus import LazyCorpusLoader, CategorizedPlaintextCorpusReader
>>> from nltk.corpus import stopwords
>>> my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> mr = my_movie_reviews
>>>
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> for i in documents:
... print i
...
([u'Food', u'fit', u'****'], u'neg')
([u'Slow', u'service'], u'neg')
([u'great', u'restaurant'], u'pos')
([u'great', u'time', u'chez', u'jerome'], u'pos')
(For more details, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L144)
2. Create your own CategorizedPlaintextCorpusReader
If you have no access to nltk.data
directory and you want to use your own corpus, try this:
# Let's say that your corpus is saved on `/home/alvas/my_movie_reviews/`
>>> import string; from nltk.corpus import stopwords
>>> from nltk.corpus import CategorizedPlaintextCorpusReader
>>> mr = CategorizedPlaintextCorpusReader('/home/alvas/my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>>
>>> for doc in documents:
... print doc
...
([u'Food', u'fit', u'****'], 'neg')
([u'Slow', u'service'], 'neg')
([u'great', u'restaurant'], 'pos')
([u'great', u'time', u'chez', u'jerome'], 'pos')
Similar questions has been asked on Creating a custom categorized corpus in NLTK and Python and Using my own corpus for category classification in Python NLTK
Here's the full code that will work:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk
mydir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
这篇关于在NLTK中使用我自己的语料库而不是movie_reviews语料库进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!