使用 tm() 从 R 语料库中删除非英语文本 [英] Removing non-English text from Corpus in R using tm()

查看:31
本文介绍了使用 tm() 从 R 语料库中删除非英语文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tm()wordcloud() 在 R 中进行一些基本的数据挖掘,但遇到了困难,因为我的文件中有非英文字符数据集(尽管我已经尝试根据背景变量过滤掉其他语言.

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.

假设我的 TXT 文件中的某些行(在 TextWrangler 中保存为 UTF-8)如下所示:

Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:

Special
satisfação
Happy
Sad
Potential für

然后我将我的 txt 文件读入 R:

I then read my txt file into R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

这会产生警告消息:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

但由于这是一个警告,而不是一个错误,我继续向前推进.

But since it's a warning, not an error, I continue to push forward.

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

这会产生错误:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

我乐于寻找在 TextWrangler 或 R 中过滤掉非英语字符的方法;什么是最方便的.感谢您的帮助!

I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!

推荐答案

这里有一个在制作语料库前去除非 ASCII 字符的方法:

Here's a method to remove words with non-ASCII characters before making a corpus:

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg. 
# dat <- readLines('~/temp/dat.txt')
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Special, Happy, Sad, Potential

这篇关于使用 tm() 从 R 语料库中删除非英语文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆