R utm软件包在'utf8towcs'中的无效输入 [英] R tm package invalid input in 'utf8towcs'
问题描述
我正在尝试使用R中的tm包来执行一些文本分析.我绑了以下东西:
I'm trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
问题是某些字符无效.我想从R内部或导入文件进行处理之前,从分析中排除无效字符.
The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.
我尝试使用iconv将所有文件转换为utf-8,并排除了所有无法转换为utf-8的内容,如下所示:
I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:
find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;
如此处指出批量转换latin-1使用iconv将文件保存到utf-8
但是我仍然遇到相同的错误.
But I still get the same error.
我将不胜感激.
推荐答案
以上答案均不适用于我.解决此问题的唯一方法是删除所有非图形字符( http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html ).
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).
代码很简单
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")
这篇关于R utm软件包在'utf8towcs'中的无效输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!