在R中导入带有特殊字符的数据 [英] Importing data with special characters in R

查看:618
本文介绍了在R中导入带有特殊字符的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下图显示了在R中导入数据(记事本)之前和导入之后数据的状态。

The following pic shows how the data is before i import it(notepad) in R and after importing.

我使用以下命令将其导入R:

I use the following command to import it in R:

Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")

可以看出,像ae这样的特殊字符已被替换为A |。 (第19行位于左侧,第18行位于右侧)。有没有办法按原样导入CSV文件? (使用R)

It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

推荐答案

您的问题是编码问题。这样做有两个方面:首先,Notepad ++保存的内容可能与所保存的文本文件中所期望的编码不符,其次,R可能正在使用 read.csv来读取文件。 ()基于不同的编码,这尤其可能,因为如果使用的是Notepad ++,则表明您使用的是Windows,因此您可能无法将UTF-8用作R的系统语言环境。

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.

因此依次轮流考虑每个问题:

So taking each issue in turn:


  1. 获取Notepad ++以特定的编码格式保存文件。在这里,您可以根据这些说明设置新文件的编码。我一直使用UTF-8,但由于您的文字是丹麦语,因此Latin-1也应该适用。

  1. Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

要验证文本的编码,您可能希望使用file 实用程序 https://cran.r-project.org/bin/windows/Rtools/index.html rel = nofollow noreferrer> RTools 。这将告诉您一些有关命令行中文件可能编码的信息,尽管它并不完美。 (OS X和Linux用户已经具有此功能,而无需安装其他实用程序。)

To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

将.csv文件导入R时设置编码。使用 read.csv()导入文件时,指定 encoding = UTF-8 encoding = Latin-1 。您可能还想检查一下系统编码是什么,并进行匹配。您可以使用 Sys.getlocale()进行此操作(并使用 Sys.setlocale()进行设置)。例如:

Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

> Sys.getlocale()
[1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"

您当然可以将其设置为Windows-1252,但是如果在其他平台上使用它,则可能无法移植。 UTF-8是最好的解决方案。

You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

这篇关于在R中导入带有特殊字符的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆