R如何处理Unicode/UTF-8? [英] How does R handle Unicode / UTF-8?

查看:369
本文介绍了R如何处理Unicode/UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我写

`Δ` <- function(a,b)   (a-b)/a 

然后我可以包含U+394,只要它包含在反引号中即可. (相反,Δ <- function(a,b) (a-b)/a失败,而unexpected input in "�"失败.)因此,显然R解析了UTF-8或Unicode之类的东西.分配进行得很好,例如对

then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg

`Δ`(1:5, 9:13)

.而且我还可以评估Δ(1:5, 9:13).

最后,如果我定义了winsorise <- function(x, λ=.05) { ... }之类的内容,则不需要通过反引号将λ(U+3bb) 引入"R.然后,我可以毫无问题地致电winsorise(data, .1).

Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.

唯一的提及在R的文档中,我可以找到unicode了.理解它的人可以更好地向我解释-当R需要`来理解对assignment的赋值,但可以解析♔(a,b,c)后可以解析吗?

The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?

推荐答案

我无法说说函数调用与函数参数的内幕,但是

I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):

R可以很好地传递,打印和绘制UTF-8字符数据,但是它几乎可以转换为所有字符级操作的本机编码(而不仅仅是在Windows上). ?Encoding阐​​明例外情况[...]

R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]

Windows没有UTF-8语言环境,而是希望使用UCS-2字符串.未经大量更改,R(以标准C编写)将无法在UCS-2内部使用.

Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.

?Quotes的R文档说明了有时如何有时仍可以使用非语言环境的字符(添加了强调):

The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):

标识符由字母,数字,句点(.)和下划线组成.标识符不得以数字,下划线或以句点后跟数字开头.保留字不是有效的标识符.

Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.

字母的定义取决于当前的语言环境,但是只有ASCII数字才被认为是数字.

The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.

此类标识符也称为语法名称,可以直接在R代码中使用.几乎总是可以使用使用其他名称的方式.首选的引号是反引号(`),而deparse通常会使用它,但是在许多情况下,可以使用单引号或双引号(因为字符常量通常会转换为名称).反引号可能是必不可少的地方之一是在公式中定界变量名称:请参见公式.

Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.

还有另一种获取此类字符的方法,即使用Unicode转义序列(如Δ的\u0394).如果您将该字符用于绘图中的文本以外的其他字符,则通常是个坏主意(即,不要对变量或函数名称使用此字符;请参见

There is another way to get at such characters, which is using the unicode escape sequence (like \u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):

如果提供给解析器的字符串在当前语言环境中包含\ uxxxx转义无效,则该字符串将以声明的编码记录在UTF-8中. 如果在会话的后期使用它很可能会引发错误,但是可以将其打印出来并用于例如在Windows()设备上进行绘制.因此,"\ u03b2"给出了希腊小贝塔字样,"\ u2642"给出了男性符号"字样.这样的字符串将被打印为例如<U+2642>,但在Rgui控制台中除外(请参见下文).

If a string presented to the parser contains a \uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "\u03b2" gives a Greek small beta and "\u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).

我认为这解决了您的大多数问题,尽管我不知道为什么您给出的函数名称和函数参数示例之间会有区别;希望有更多知识渊博的人能对此有所了解.仅供参考,在Linux上,所有这些不同的分配和调用函数的方式都不会出错(因为系统区域设置为UTF-8,因此无需进行翻译):

I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):

Δ <- function(a,b) (a-b)/a         # no error
`Δ` <- function(a,b) (a-b)/a       # no error
"Δ" <- function(a,b) (a-b)/a       # no error
"\u0394" <- function(a,b) (a-b)/a  # no error
Δ(1:5, 9:13)        # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13)      # same
"Δ"(1:5, 9:13)      # same
"\u0394"(1:5, 9:13) # same

sessionInfo()

# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)

# locale:
# LC_CTYPE=en_US.UTF-8    LC_NUMERIC=C                LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8     LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8    LC_NAME=C                   LC_ADDRESS=C
# LC_TELEPHONE=C          LC_MEASUREMENT=en_US.UTF-8  LC_IDENTIFICATION=C

# attached base packages:
# stats  graphics  grDevices  utils  datasets  methods  base  

这篇关于R如何处理Unicode/UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆