RStudio不选择编码我告诉它在读取文件时使用 [英] RStudio not picking the encoding I'm telling it to use when reading a file

查看:465
本文介绍了RStudio不选择编码我告诉它在读取文件时使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在R中阅读以下UTF-8编码文件,但每当我阅读时, unicode字符无法正确编码:

I'm trying to read the following UTF-8 encoded file in R, but whenever I read it, the unicode characters are not encoded correctly:

我用来处理文件的脚本如下:

The script I'm using to process the file is as follows:

defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
                                                     "factor", "factor", "factor", "factor", "factor", "numeric", 
                                                     "numeric", "numeric", "numeric", "numeric", "numeric",
                                                     "numeric", "numeric", "numeric", "numeric", "numeric", 
                                                     "numeric", "character", "character")

readDetalheVotacaoMunicipioZona <- function( fileName ) {
  fileConnection = file(fileName,encoding=defaultEncoding)
  contents <- readChar(fileConnection, file.info(fileName)$size)  
  close(fileConnection)
  contents <- gsub('"', "", contents)

  columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
                   "codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos", 
                   "qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
                   "qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda", 
                   "qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")

  read.csv(text=contents, 
           colClasses=detalheVotacaoMunicipioZonaTypes,
           sep=";", 
           col.names=columnNames, 
           fileEncoding=defaultEncoding,
           header=FALSE)
}



读取以UTF-8编码发送的文件,删除所有引号(即使数字都是引号,因此我需要清理它们),然后将内容提供给 read.csv 。它正确地读取和处理文件,但似乎它不使用我给它的编码信息。

I read the file sending in the UTF-8 encoding, remove all quotes (even numbers are quoted, so I need to clean them up) and then feed the contents to read.csv. It reads and processes the file correctly but it seems like it's not using the encoding information I'm giving it.

我应该做什么,使它使用UTF-8阅读这个文件?

What should I do to make it use UTF-8 to read this file?

我在OSX上使用RStudio如果有什么区别。

I'm using RStudio on OSX if it makes any difference.

推荐答案

这个问题是由错误的区域设置,无论是在RStudio或命令行R:

This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:


  1. p> 如果问题只发生在RStudio 而不是命令行R,请去RStudio->首选项:一般,告诉我们'默认文本编码:'设置为,单击更改,并尝试Windows-1252,UTF-8或ISO8859-1('latin1')(或者如果您总是希望提示,请询问)。底部附有屏幕截图。让我们知道哪一个工作!
  1. If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!

如果问题也出现在命令行R 中,请执行以下操作:

If the problem also happens in command-line R, do the following:

在您的Mac上做 locale -m ,告诉我们它是支持CP1252还是ISO8859-1 'latin1')?如果需要,转储支持的语言环境列表。 (您可以告诉我们您的MacOS版本。)

Do locale -m on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)

对于这两个地区,尝试更改为该地区:

For both of those locales, try to change to that locale:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")


严格来说, Sys.setlocale()命令应该在你的〜/ .Rprofile 强>不在R会话或源代码。
然而 Sys.setlocale()可能会失败,所以只要意识到。此外,在我的早期和经常,在你的设置代码中assert Sys.getlocale()。 (真的, read.csv 应该弄清楚它使用的编码是否与语言环境兼容,如果不是,警告或错误)。

That should work. Strictly the Sys.setlocale() command should go in your ~/.Rprofile for startup, not inside your R session or source-code. However Sys.setlocale() can fail, so just be aware of that. Also, assert Sys.getlocale() inside your setup code early and often, as I do. (really, read.csv should figure out if the encoding it uses is compatible with the locale, and warn or error if not).

让我们知道哪个修复工作!

Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.


  1. RStudio首选项的屏幕截图更改默认文本编码菜单:

这篇关于RStudio不选择编码我告诉它在读取文件时使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆