在python中精确复制R文本预处理 [英] Exactly replicating R text preprocessing in python
问题描述
我想以与R中相同的方式使用Python预处理文档的语料库.例如,给定初始语料库corpus
,我希望得到一个与文档名相对应的预处理语料库.一种使用以下R代码制作而成:
I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus
, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:
library(tm)
library(SnowballC)
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
在Python中是否有一种简单或直接的方法(最好是预先构建的方法)?有没有办法确保结果完全相同?
Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?
例如,我想进行预处理
@Apple耳挂真是太神奇了!我的入耳式耳机带来的最佳音质 曾经有过!
@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!
进入
耳挂式耳机有史以来最好的入耳式耳机声音
ear pod amaz best sound inear headphon ive ever
推荐答案
要使预处理步骤中nltk
和tm
之间的内容完全相同似乎很棘手,所以我认为最好的方法是使用
It seems tricky to get things exactly the same between nltk
and tm
on the preprocessing steps, so I think the best approach is to use rpy2
to run the preprocessing in R and pull the results into python:
import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
然后,可以将其加载到scikit-learn
中-要使CountVectorizer
和DocumentTermMatrix
之间匹配的唯一操作是删除长度小于3的项. /p>
Then, you can load it into scikit-learn
-- the only thing you'll need to do to get things to match between the CountVectorizer
and the DocumentTermMatrix
is to remove terms of length less than 3:
from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]
# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>
# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>
让我们验证它是否与R:
Let's verify this matches with R:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
#
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
#
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
如您所见,现在两种方法之间存储的元素和术语的数量完全匹配.
As you can see, the number of stored elements and terms exactly match between the two approaches now.
这篇关于在python中精确复制R文本预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!