如何在R中识别/删除非UTF-8字符 [英] How to identify/delete non-UTF-8 characters in R

查看：187 发布时间：2020/7/13 2:35:28 r utf-8 stata

本文介绍了如何在R中识别/删除非UTF-8字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我导入 R 中的Stata数据集(使用foreign包)时，导入有时会包含无效的UTF-8字符.这本身就不够令人愉快，但是当我尝试将对象转换为JSON(使用rjson包)时，它会破坏所有内容.

如何识别字符串中的无效-UTF-8-字符，然后将其删除?

解决方案

使用iconv及其参数sub的另一种解决方案:字符串.如果不是NA(此处将其设置为")，则它用于替换输入中所有不可转换的字节.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

请注意，如果我们选择正确的编码:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

解决方案

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

这篇关于如何在R中识别/删除非UTF-8字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在R中识别/删除非UTF-8字符 [英] How to identify/delete non-UTF-8 characters in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在R中识别/删除非UTF-8字符 [英] How to identify/delete non-UTF-8 characters in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭