在python中精确复制R文本预处理 [英] Exactly replicating R text preprocessing in python

查看:108
本文介绍了在python中精确复制R文本预处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想以与R中相同的方式使用Python预处理文档的语料库.例如,给定初始语料库corpus,我希望得到一个与文档名相对应的预处理语料库.一种使用以下R代码制作而成:

I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:

library(tm)
library(SnowballC)

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)

在Python中是否有一种简单或直接的方法(最好是预先构建的方法)?有没有办法确保结果完全相同?

Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?

例如,我想进行预处理

@Apple耳挂真是太神奇了!我的入耳式耳机带来的最佳音质 曾经有过!

@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!

进入

耳挂式耳机有史以来最好的入耳式耳机声音

ear pod amaz best sound inear headphon ive ever

推荐答案

要使预处理步骤中nltktm之间的内容完全相同似乎很棘手,所以我认为最好的方法是使用在R中运行预处理并将结果提取到python中:

It seems tricky to get things exactly the same between nltk and tm on the preprocessing steps, so I think the best approach is to use rpy2 to run the preprocessing in R and pull the results into python:

import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]

然后,可以将其加载到scikit-learn中-要使CountVectorizerDocumentTermMatrix之间匹配的唯一操作是删除长度小于3的项. /p>

Then, you can load it into scikit-learn -- the only thing you'll need to do to get things to match between the CountVectorizer and the DocumentTermMatrix is to remove terms of length less than 3:

from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
    return [y for y in x.split() if len(y) > 2]

# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
#   with 8980 stored elements in Compressed Sparse Column format>

# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
#   with 4669 stored elements in Compressed Sparse Column format>

让我们验证它是否与R:

Let's verify this matches with R:

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
# 
# Non-/sparse entries: 8980/3875329
# Sparsity           : 100%
# Maximal term length: 115 
# Weighting          : term frequency (tf)

sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
# 
# Non-/sparse entries: 4669/360260
# Sparsity           : 99%
# Maximal term length: 20 
# Weighting          : term frequency (tf)

如您所见,现在两种方法之间存储的元素和术语的数量完全匹配.

As you can see, the number of stored elements and terms exactly match between the two approaches now.

这篇关于在python中精确复制R文本预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆