在 R 中打印 unicode 字符串 [英] Print unicode character string in R

查看:32
本文介绍了在 R 中打印 unicode 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 .csv 文件中输入了一个文本字符串,其中包含 unicode 符号:\U00B5 g/dL.在 .csv 文件中以及在 R 数据框中读取:

I entered a text string in .csv file , which includes unicode symbols as: \U00B5 g/dL. In .csv file as well as read in R data frame:

test=read.csv("test.csv")

\U00B5 将产生微符号 - µ.R 按原样将其读入数据文件 (\U00B5).但是,当我打印字符串时,它显示为 \\U00B5 g/dL.
或者,手动输入代码也可以.

\U00B5 would produce the micro sign- µ. R read it into data file as it is (\U00B5). However when I print the string it shows as \\U00B5 g/dL.
Alternatively, manually entering the code works fine.

varname <- c("a", "b", "c")
labels <- c("A \U00B5 g/dL", "B \U00B5 g/dL", "C \U00B5 g/dL")
df <- data.frame(varname, labels)
test <- data.frame(varname, labels)
test
#  varname   labels
#  1       a A µ g/dL
#  2       b B µ g/dL
#  3       c C µ g/dL

我想知道在这种情况下如何摆脱转义符 \ 并让它打印出符号.或者,如果有另一种方法可以打印出 R 中的符号.

I wonder how could I get rid of the escape sign \ in this case and have it print out the symbol. Or, if there another way to print out the symbol in R.

非常感谢您的帮助!

推荐答案

好吧,首先要了解 R 中的某些字符如果在标准 ASCII 字符之外,则必须对其进行转义.通常这是用\"字符完成的.这就是为什么在 R 中编写字符串时需要转义此字符的原因:

Well, first understand that certain characters in R must be escaped if they are outside the standard ASCII-characters. Typically this is done with a "\" character. That's why you need to escape this character when you write a string in R:

a <- "\" # error
a <- "\\" # ok.

\U"是Unicode转义的特殊指示符.请注意,使用此转义时,字符串本身中没有斜线或 U.它只是特定字符的快捷方式.注意:

The "\U" is a special indicator for unicode escaping. Note that there are no slashes or U's in the string itself when you use this escaping. It is just a shortcut to a specific character. Note:

a <- "\U00B5"
cat(a)
# µ
grep("U",a)
# integer(0)
nchar(a)
# [1] 1

这与字符串非常不同

a <- "\\U00B5"
cat(a)
# \U00B5
grep("U",a)
# [1] 1
nchar(a)
# [1] 6

通常,当您导入文本文件时,您会以文件使用的任何编码方式对非 ASCII 字符进行编码(UTF-8 或 Latin-1 是最常见的).它们有特殊的字节来表示这些字符.文本文件对于 unicode 字符具有 ASCII 转义序列并不是正常的".这就是为什么 R 不尝试将 "\U00B5" 转换为 unicode 字符的原因,因为它假定如果您想要一个 unicode 字符,您将直接使用它.

Normally when you import a text file, you would encode non-ASCII character in whatever encoding is used by the file (UTF-8, or Latin-1 are the most common). They have special bytes to represent these characters. It's not "normal" for a text file to have an ASCII escape sequence for unicode characters. This is why R doesn't attempt to convert "\U00B5" to a unicode character because it assumes that if you had wanted a unicode character, you would have just used it directly.

重新插入 ASCII 字符值的最简单方法是使用 stringi 包.例如

The easiest way to re-interpet your ASCII character values would be to use the stringi package. For example

library(stringi)
a <- "\\U00B5"
stri_unescape_unicode(gsub("\\U","\\u",a, fixed=TRUE))

(唯一的问题是我们需要将 "\U" 转换为更常见的 "\u",以便函数正确识别转义符).您可以使用

(the only catch is that we needed to convert "\U" to the more common "\u" so the function properly recognized the escape). You can do this to your imported data with

test$label <- stri_unescape_unicode(gsub("\\U","\\u",test$label, fixed=TRUE))

这篇关于在 R 中打印 unicode 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆