这是在 Ruby 中取消转义 unicode 转义序列的最佳方法吗? [英] Is this the best way to unescape unicode escape sequences in Ruby?
问题描述
我有一些文本包含像 \u003C 这样的 Unicode 转义序列.这就是我想出的办法来逃避它:
I have some text that contains Unicode escape sequences like \u003C. This is what I came up with to unescape it:
string.gsub(/\u(....)/) {|m|[$1].pack("H*").unpack("n*").pack("U*")}
正确吗?(即它似乎适用于我的测试,但知识渊博的人能否发现它的问题?)
Is it correct? (i.e. it seems to work with my tests, but can someone more knowledgeable find a problem with it?)
推荐答案
您的正则表达式 /\u(....)/
有一些问题.
Your regex, /\u(....)/
, has some problems.
首先,\u
不会像你想象的那样工作,在 1.9 中你会得到一个错误,而在 1.8 中它只会匹配一个 u
而不是您正在寻找的 \u
对;您应该使用 /\\u/
来查找您想要的文字 \u
.
First of all, \u
doesn't work the way you think it does, in 1.9 you'll get an error and in 1.8 it will just match a single u
rather than the \u
pair that you're looking for; you should use /\\u/
to find the literal \u
that you want.
其次,您的 (....)
组过于宽松,这将允许任何四个字符通过,这不是您想要的.在 1.9 中,您需要 (\h{4})
(四个十六进制数字),但在 1.8 中,您需要 ([\da-fA-F]{4})
因为 \h
是一个新事物.
Secondly, your (....)
group is much too permissive, that will allow any four characters through and that's not what you want. In 1.9, you want (\h{4})
(four hexadecimal digits) but in 1.8 you'd need ([\da-fA-F]{4})
as \h
is a new thing.
因此,如果您希望正则表达式在 1.8 和 1.9 中都能使用,则应该使用 /\\u([\da-fA-F]{4})/
.这为您提供了 1.8 和 1.9 中的以下内容:
So if you want your regex to work in both 1.8 and 1.9, you should use /\\u([\da-fA-F]{4})/
. This gives you the following in 1.8 and 1.9:
>> s = 'Where is \u03bc pancakes \u03BD house? And u1123!'
=> "Where is \\u03bc pancakes \\u03BD house? And u1123!"
>> s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
=> "Where is μ pancakes ν house? And u1123!"
使用pack
和 unpack
来破坏将十六进制数字转换为 Unicode 字符可能已经足够了,但可能有更好的方法.
Using pack
and unpack
to mangle the hex number into a Unicode character is probably good enough but there may be better ways.
这篇关于这是在 Ruby 中取消转义 unicode 转义序列的最佳方法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!