如何在R中识别/删除非UTF-8字符 [英] How to identify/delete non-UTF-8 characters in R

查看:187
本文介绍了如何在R中识别/删除非UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我导入 R 中的Stata数据集(使用foreign包)时,导入有时会包含无效的UTF-8字符.这本身就不够令人愉快,但是当我尝试将对象转换为JSON(使用rjson包)时,它会破坏所有内容.

如何识别字符串中的无效-UTF-8-字符,然后将其删除?

解决方案

使用iconv及其参数sub的另一种解决方案:字符串.如果不是NA(此处将其设置为"),则它用于替换输入中所有不可转换的字节.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

请注意,如果我们选择正确的编码:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

解决方案

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

这篇关于如何在R中识别/删除非UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆