十六进制代码(\x)和unicode(\u)字符有什么区别? [英] What's the difference between hex code (\x) and unicode (\u) chars?

查看:4046
本文介绍了十六进制代码(\x)和unicode(\u)字符有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 > \xnn给定十六进制代码(1或2个十六进制数字)的字符
\ unnnn带给定代码的Unicode字符(1--4十六进制数字)


在Unicode字符只有一个或两个数字的情况下,我希望这些字符是相同的。事实上,?Quotes 帮助页面上的一个示例显示:

 \x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21
## [1]你好,世界!
\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21
## [1 ] 你好,世界!

然而,在Linux下,当试图打印磅符号时,我看到

  cat(\ua3)
## $
cat(\xa3)
# #

也就是说, \x 十六进制代码无法正确显示。 (这种行为持续存在于我尝试过的任何语言环境中。)在Windows 7下,两个版本都显示一个磅符号。



如果我转换为整数并返回,那么井号正确显示在Linux下。

  cat(intToUtf8(utf8ToInt(\xa3)))
##£$ b顺便说一下,这在Windows下不起作用,因为 utf8ToInt(\xa3 )返回 NA



一些 \x 字符在Windows下返回 NA 但在Linux下引发错误。例如:

  utf8ToInt(\xf0)
## utf8ToInt错误(\xf0 ):无效的UTF-8字符串

\uf0 )

这些例子表明 \x 和 \u 字符形式,这似乎是特定于操作系统的,但我无法看到它们如何定义的逻辑。



这两个字符形式有什么区别?

解决方案

转义序列 \xNN 将原始字节 NN 插入到字符串中,而 \ uNN 将Unicode代码点 NN 的UTF-8字节插入到UTF-8字符串中:

 > charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3

这两种类型转义序列不能混合在同一个字符串中:

 > '\ua3\xa3'
错误:不允许在字符串中混合Unicode和八进制/十六进制转义

这是因为转义序列还定义了字符串的编码 \uNN 序列显式地将整个字符串的编码设置为UTF-8,而 \xNN 将其保留在默认的未知(又名本地)编码中:

 >编码('\xa3')
[1]未知
>编码('\ua3')
[1]UTF-8

在打印字符串时变得很重要,因为它们需要转换成适当的输出编码(例如,您的控制台)。具有已定义编码的字符串可以进行适当的转换(请参阅 enc2native ),但那些带有未知编码的字符串原样输出:




  • 在Linux上,您的控制台可能期望使用UTF-8文本,并且 0xA3 不是有效的UTF-8在Windows上,控制台可能需要Windows-1252文本,而 0xA3 是正确的编码为£,这就是你所看到的。 (当字符串 \uA3 时,将发生从UTF-8到Windows-1252的转换。)



如果显式设置编码,则适当的转换将发生在Linux上:

 > ; s<  - '\xa3'
>编码< - 'latin1'
> cat(s)
£


From ?Quotes:

\xnn   character with given hex code (1 or 2 hex digits)  
\unnnn Unicode character with given code (1--4 hex digits)

In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes help page shows:

"\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
## [1] "Hello World!"
"\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21"
## [1] "Hello World!"

However, under Linux, when trying to print a pound sign, I see

cat("\ua3")
## £
cat("\xa3")
## �

That is, the \x hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.

If I convert to integer and back then the pound sign displays correctly under Linux.

cat(intToUtf8(utf8ToInt("\xa3")))
## £

Incidentally, this doesn't work under Windows, since utf8ToInt("\xa3") returns NA.

Some \x characters return NA under Windows but throw an error under Linux. For example:

utf8ToInt("\xf0")
## Error in utf8ToInt("\xf0") : invalid UTF-8 string

("\uf0" is a valid character.)

These examples show that there are some differences between \x and \u forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.

What are the difference between these two character forms?

解决方案

The escape sequence \xNN inserts the raw byte NN into a string, whereas \uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:

> charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3

These two types of escape sequence cannot be mixed in the same string:

> '\ua3\xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed

This is because the escape sequences also define the encoding of the string. A \uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN leaves it in the default "unknown" (aka. native) encoding:

> Encoding('\xa3')
[1] "unknown"
> Encoding('\ua3')
[1] "UTF-8"

This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:

  • On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "�".
  • On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is \uA3, a conversion from UTF-8 to Windows-1252 takes place.)

If the encoding is set explicitly, the appropriate conversion will take place on Linux:

> s <- '\xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£

这篇关于十六进制代码(\x)和unicode(\u)字符有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆