R 如何处理 Unicode/UTF-8? [英] How does R handle Unicode / UTF-8?

查看:21
本文介绍了R 如何处理 Unicode/UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我写

`Δ` <- function(a,b)   (a-b)/a 

然后我可以包含 U+394 只要它包含在反引号中.(相比之下,Δ <- function(a,b) (ab)/a 失败,在""中出现意外输入.)所以显然 R 解析 UTF-8或 Unicode 或类似的东西.分配进行得很顺利,对 eg 的评估也是如此

then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg

`Δ`(1:5, 9:13)

.我还可以评估 Δ(1:5, 9:13).

最后,如果我定义了类似 winsorise <- function(x, λ=.05) { ... } 那么 λ (U+3bb) 不需要用反引号介绍给"R.然后我可以毫无问题地调用 winsorise(data, .1).

Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.

唯一的提及 在 R 的文档中,我发现 unicode 超出了我的理解.理解它的人能否更好地向我解释——当 R 需要 ` 来理解对 ♔ 的赋值,但一旦赋值就可以解析 ♔(a,b,c) 时,幕后"发生了什么?

The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?

推荐答案

关于函数调用与函数参数,我无法谈论幕后发生的事情,但是 这封来自 Ripley 教授 2008 年的电子邮件 可能会有所启发(摘录如下):

I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):

R 可以很好地传递、打印和绘制 UTF-8 字符数据,但它可以转换为几乎所有字符级操作的本机编码(不仅在 Windows 上).?Encoding 说明了例外情况 [...]

R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]

OP 链接到的文档:

Windows 没有 UTF-8 语言环境,而是期望使用 UCS-2 字符串.R(用标准 C 编写)如果不进行大量更改,将无法在内部与 UCS-2 一起使用.

Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.

?Quotes 的 R 文档解释了您有时如何使用区域外字符(强调已添加):

The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):

标识符由一系列字母、数字、句点 (.) 和下划线组成.它们不能以数字或下划线开头,也不能以句点后跟数字开头.保留字不是有效标识符.

Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.

字母的定义取决于当前的语言环境,但只有 ASCII 数字才被认为是数字.

The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.

此类标识符也称为句法名称,可以直接在 R 代码中使用.几乎总是,可以使用其他名称,前提是它们被引用.首选引号是反引号 (`),deparse 通常会使用它,但在许多情况下可以使用单引号或双引号(因为字符常量通常会转换为名称).反引号可能必不可少的一个地方是在公式中分隔变量名称:参见公式.

Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.

还有另一种获取此类字符的方法,即使用 unicode 转义序列(如 u0394 表示 Δ).如果您将该字符用于绘图上的文本以外的任何内容,这通常是一个坏主意(即,不要对变量或函数名称执行此操作;参见 R 2.7 发行说明,当前大部分 UTF-8 支持已添加):

There is another way to get at such characters, which is using the unicode escape sequence (like u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):

如果提供给解析器的字符串包含在当前语言环境中无效的 uxxxx 转义符,则该字符串将记录在 UTF-8 中并声明编码.如果在会话中稍后使用,这可能会引发错误,但可以打印,并用于例如在 windows() 设备上绘图. 所以u03b2"给出了一个希腊小测试版,而u2642"给出了一个男性符号".这样的字符串将被打印为例如 除了在 Rgui 控制台中(见下文).

If a string presented to the parser contains a uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "u03b2" gives a Greek small beta and "u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).

我认为这解决了您的大部分问题,但我不知道为什么您提供的函数名称和函数参数示例之间存在差异;希望有更博学的人可以插话.仅供参考,在 Linux 上,所有这些分配和调用函数的不同方式都可以正常工作(因为系统区域设置是 UTF-8,因此不需要进行转换):

I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):

Δ <- function(a,b) (a-b)/a         # no error
`Δ` <- function(a,b) (a-b)/a       # no error
"Δ" <- function(a,b) (a-b)/a       # no error
"u0394" <- function(a,b) (a-b)/a  # no error
Δ(1:5, 9:13)        # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13)      # same
"Δ"(1:5, 9:13)      # same
"u0394"(1:5, 9:13) # same

sessionInfo()

# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)

# locale:
# LC_CTYPE=en_US.UTF-8    LC_NUMERIC=C                LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8     LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8    LC_NAME=C                   LC_ADDRESS=C
# LC_TELEPHONE=C          LC_MEASUREMENT=en_US.UTF-8  LC_IDENTIFICATION=C

# attached base packages:
# stats  graphics  grDevices  utils  datasets  methods  base  

这篇关于R 如何处理 Unicode/UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆