R中的非ASCII字符问题 [英] Issue with Non-ASCII Characters in R
问题描述
我正在使用具有各种非ASCII/双字节字符的自由文本字段将数据加载到R中.根据我用来加载数据的功能或数据存储的格式( .csv
或 .xlsx
),字符的显示方式有所不同.
I am loading data in R with free-text fields that have a variety of non-ASCII/double byte characters. Depending on the function I use to load the data or the format in which the data is stored (.csv
or .xlsx
), the characters appear differently.
具体来说,如果我将 read.csv
与 .csv
文件一起使用,或者将 read_excel
与 .xlsx
一起使用文件中的字符显示为:Orientaci�nm�s
.
Specifically, if I use read.csv
with a .csv
file or read_excel
with a .xlsx
file the characters appears something like: Orientaci�n m�s
.
同时,如果我将 read_csv
与 .csv
文件一起使用,它们将显示为:Orientaciñnm
Meanwhile, if I use read_csv
with a .csv
file they appear like this: Orientaci�n m�s
是否存在文件格式/数据加载组合来解决此问题?还是一旦加载完毕,是否可以通过某种方式对两种格式的数据进行解码?我探索了多种方法,包括在适当的地方更改编码参数和 decoder
包,但是我什么都无法工作.
Is there file format/data load combination to fix this issue? Or is there some way to decode the data in either format once it is already loaded? I have explored a variety of methods including changing the encoding arguments where applicable and the decoder
package, but I can't get anything to work.
有想法吗?
根据下面的评论,我尝试了以下操作:
Per comments below I have tried the following:
readr::guess_encoding("file with issue.csv")
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
2 ISO-8859-1 0.52
readr::guess_encoding("file without issue.csv")
guess_encoding("Goal_Details.csv")
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
2 windows-1252 0.51
iconv(x,"ISO-8859-1","windows-1252")
x
与此问题对应的字符串/字段,但仍不能解决问题.
x
Corresponds to the string/field with the issue, but it still doesn't fix the problem.
有想法吗?
推荐答案
在进一步调查后,答案是已经对‘.’进行了解码.在某些时候,原始字符没有被解码,因此Windows默认情况下基本上是说我不知道这是什么",并且它对任何非ASCII字符都执行此操作.
Upon further investigation, the answer is that the ’�’ already is decoded. At some point the original characters were not decoded, so windows defaults to basically saying "I don’t know what this is", and it does that for any non-ASCII character.
例如,一旦到达这一点,就无法在á和¿之间进行区分.这些类型的字符都有人行横道,但是在这里行不通,因为替换必须在语言级别进行,这是完全不同的问题.
For example, there’s no distinguishing between á and ¿ once reaching this point. There are crosswalks available for these types of characters, but they wouldn’t work here as replacement would need to be at the language level, which is an entirely different issue.
从本质上讲,要么必须替换或删除.",然后运行多种语言的拼写检查器.
Essentially, one would either have to replace or remove the ’�’ and run a spell checker in multiple languages.
这篇关于R中的非ASCII字符问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!