如何在R中识别/删除非UTF-8字符 [英] How to identify/delete non-UTF-8 characters in R
问题描述
当我导入 R 中的Stata数据集(使用foreign
包)时,导入有时会包含无效的UTF-8
字符.这本身就不够令人愉快,但是当我尝试将对象转换为JSON
(使用rjson
包)时,它会破坏所有内容.
如何识别字符串中的无效-UTF-8
-字符,然后将其删除?
使用iconv
及其参数sub
的另一种解决方案:字符串.如果不是NA(此处将其设置为"),则它用于替换输入中所有不可转换的字节.
x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"
请注意,如果我们选择正确的编码:
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile
When I import a Stata dataset in R (using the foreign
package), the import sometimes contains characters that are not valid UTF-8
. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON
(using the rjson
package).
How I can identify non-valid-UTF-8
-characters in a string and delete them after that?
Another solution using iconv
and it argument sub
: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.
x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"
Here note that if we choose the right encoding:
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile
这篇关于如何在R中识别/删除非UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!