'utf8towcs' 中的 R tm 包无效输入 [英] R tm package invalid input in 'utf8towcs'

查看：38 发布时间：2021/12/28 16:38:15 r utf-8 iconv text-mining

本文介绍了'utf8towcs' 中的 R tm 包无效输入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 R 中的 tm 包来执行一些文本分析.我绑定了以下内容:

I'm trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'

问题是某些字符无效.我想从 R 中或在导入文件进行处理之前从分析中排除无效字符.

The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.

我尝试使用 iconv 将所有文件转换为 utf-8 并排除任何无法转换为 utf-8 的内容，如下所示:

I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:

find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" ;

正如这里所指出的批量转换 latin-1使用 iconv 将文件转换为 utf-8

但我仍然遇到同样的错误.

But I still get the same error.

感谢您的帮助.

推荐答案

以上答案都不适合我.解决此问题的唯一方法是删除所有非图形字符 (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

代码就是这么简单

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

这篇关于'utf8towcs' 中的 R tm 包无效输入的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

'utf8towcs' 中的 R tm 包无效输入 [英] R tm package invalid input in 'utf8towcs'

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

'utf8towcs' 中的 R tm 包无效输入 [英] R tm package invalid input in &#39;utf8towcs&#39;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

'utf8towcs' 中的 R tm 包无效输入 [英] R tm package invalid input in 'utf8towcs'

登录关闭