NLTK的高效术语文档矩阵 [英] efficient Term Document Matrix with NLTK
问题描述
我正在尝试使用NLTK和熊猫创建术语文档矩阵. 我写了以下函数:
I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:
def fnDTM_Corpus(xCorpus):
import pandas as pd
'''to create a Term Document Matrix from a NLTK Corpus'''
fd_list = []
for x in range(0, len(xCorpus.fileids())):
fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
DTM.fillna(0,inplace = True)
return DTM.T
运行
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
x = fnDTM_Corpus(newcorpus)
它对于语料库中的一些小文件效果很好,但是当我尝试使用4,000个文件(每个约2 kb)的语料库运行它时,出现了 MemoryError .
It works well for few small files in the corpus but gives me a MemoryError when I try to run it with a corpus of 4,000 files (of about 2 kb each).
我想念什么吗?
我正在使用32位python. (在Windows 7、64位OS,Core Quad CPU,8 GB RAM上).我真的需要对这种大小的语料库使用64位吗?
I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?
推荐答案
感谢Radim和Larsmans. 我的目标是要拥有一个与您在R tm中获得的DTM类似的DTM. 我决定使用scikit-learn,部分受
Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.
我将其发布在这里,希望其他人会发现它有用.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def fn_tdm_df(docs, xColNames = None, **kwargs):
''' create a term document matrix as pandas DataFrame
with **kwargs you can pass arguments of CountVectorizer
if xColNames is given the dataframe gets columns Names'''
#initialize the vectorizer
vectorizer = CountVectorizer(**kwargs)
x1 = vectorizer.fit_transform(docs)
#create dataFrame
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
if xColNames is not None:
df.columns = xColNames
return df
在目录中的文本列表上使用它
to use it on a list of text in a directory
DIR = 'C:/Data/'
def fn_CorpusFromDIR(xDIR):
''' functions to create corpus from a Directories
Input: Directory
Output: A dictionary with
Names of files ['ColNames']
the text in corpus ['docs']'''
import os
Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
return Res
创建数据框
d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
xColNames = fn_CorpusFromDIR(DIR)['ColNames'],
stop_words=None, charset_error = 'replace')
这篇关于NLTK的高效术语文档矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!