R中的非ASCII字符问题 [英] Issue with Non-ASCII Characters in R

查看:50
本文介绍了R中的非ASCII字符问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有各种非ASCII/双字节字符的自由文本字段将数据加载到R中.根据我用来加载数据的功能或数据存储的格式( .csv .xlsx ),字符的显示方式有所不同.

I am loading data in R with free-text fields that have a variety of non-ASCII/double byte characters. Depending on the function I use to load the data or the format in which the data is stored (.csv or .xlsx), the characters appear differently.

具体来说,如果我将 read.csv .csv 文件一起使用,或者将 read_excel .xlsx 一起使用文件中的字符显示为:Orientaci�nm�s.

Specifically, if I use read.csv with a .csv file or read_excel with a .xlsx file the characters appears something like: Orientaci�n m�s.

同时,如果我将 read_csv .csv 文件一起使用,它们将显示为:Orientaciñnm

Meanwhile, if I use read_csv with a .csv file they appear like this: Orientaci�n m�s

是否存在文件格式/数据加载组合来解决此问题?还是一旦加载完毕,是否可以通过某种方式对两种格式的数据进行解码?我探索了多种方法,包括在适当的地方更改编码参数和 decoder 包,但是我什么都无法工作.

Is there file format/data load combination to fix this issue? Or is there some way to decode the data in either format once it is already loaded? I have explored a variety of methods including changing the encoding arguments where applicable and the decoder package, but I can't get anything to work.

有想法吗?

根据下面的评论,我尝试了以下操作:

Per comments below I have tried the following:

readr::guess_encoding("file with issue.csv")
    # A tibble: 2 x 2
  encoding   confidence
  <chr>           <dbl>
1 UTF-8            1   
2 ISO-8859-1       0.52

readr::guess_encoding("file without issue.csv")
guess_encoding("Goal_Details.csv")
# A tibble: 2 x 2
  encoding     confidence
  <chr>             <dbl>
1 UTF-8              1   
2 windows-1252       0.51

iconv(x,"ISO-8859-1","windows-1252")

x 与此问题对应的字符串/字段,但仍不能解决问题.

x Corresponds to the string/field with the issue, but it still doesn't fix the problem.

有想法吗?

推荐答案

在进一步调查后,答案是已经对‘.’进行了解码.在某些时候,原始字符没有被解码,因此Windows默认情况下基本上是说我不知道这是什么",并且它对任何非ASCII字符都执行此操作.

Upon further investigation, the answer is that the ’�’ already is decoded. At some point the original characters were not decoded, so windows defaults to basically saying "I don’t know what this is", and it does that for any non-ASCII character.

例如,一旦到达这一点,就无法在á和¿之间进行区分.这些类型的字符都有人行横道,但是在这里行不通,因为替换必须在语言级别进行,这是完全不同的问题.

For example, there’s no distinguishing between á and ¿ once reaching this point. There are crosswalks available for these types of characters, but they wouldn’t work here as replacement would need to be at the language level, which is an entirely different issue.

从本质上讲,要么必须替换或删除.",然后运行多种语言的拼写检查器.

Essentially, one would either have to replace or remove the ’�’ and run a spell checker in multiple languages.

这篇关于R中的非ASCII字符问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆