从数据文件中删除非ASCII字符 [英] Removing non-ASCII characters from data files
问题描述
我有一堆正在读入R的csv
文件,并包含在.rdata
格式的package/data文件夹中.不幸的是,数据中的非ASCII字符未能通过检查. tools
软件包具有两个功能来检查非ASCII字符(showNonASCII
和showNonASCIIfile
),但我似乎找不到一个要删除/清除它们的字符.
I've got a bunch of csv
files that I'm reading into R and including in a package/data folder in .rdata
format. Unfortunately the non-ASCII characters in the data fail the check. The tools
package has two functions to check for non-ASCII characters (showNonASCII
and showNonASCIIfile
) but I can't seem to locate one to remove/clean them.
在探索其他UNIX工具之前,在R中完成所有这些工作将非常棒,这样我就可以维护从原始数据到最终产品的完整工作流程.是否有任何现有的软件包/功能可以帮助我摆脱非ASCII字符?
Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?
推荐答案
要简单地删除非ASCII字符,可以使用基数R的iconv()
并设置sub = ""
.这样的事情应该起作用:
To simply remove the non-ASCII characters, you could use base R's iconv()
, setting sub = ""
. Something like this should work:
x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"
要查找非ASCII字符,或查找文件中是否存在任何字符,您可以采用以下思路:
To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:
## Do *any* lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
这篇关于从数据文件中删除非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!