URI.unescape在尝试转换“%C3%9Fą”时崩溃到“ß” [英] URI.unescape crashes as it is trying to convert "%C3%9Fą" to "ßą"
问题描述
我使用 URI.unescape 来取消剪切文本,不幸的是我遇到了奇怪的错误:
I am using URI.unescape to unescape text, unfortunately I run into weird error:
# encoding: utf-8
require('uri')
URI.unescape("%C3%9Fą")
会导致
C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
from exe/fail.rb:3:in `<main>'
为什么?
推荐答案
URI.unescape
的实现对于非ASCII输入断开。 1.9.3版本如下所示:
The implementation of URI.unescape
is broken for non-ASCII inputs. The 1.9.3 version looks like this:
def unescape(str, escaped = @regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end
正则表达式使用 /%[a-fA-F\d] {2} /
。所以它通过字符串寻找百分号后跟两个十六进制数字;在 $&
中将是匹配的文本(例如'%C3')和 $& [1,2]
是没有前导百分号的匹配文本('C3'
)。然后我们调用 String#hex
将该十六进制数转换为Fixnum( 195
)并将其包装到数组中( [195]
),以便我们可以使用 Array#pack
为我们做字节磨练。问题是 pack
给我们一个二进制字节:
The regex in use is /%[a-fA-F\d]{2}/
. So it goes through the string looking for a percent sign followed by two hex digits; in the block $&
will be the matched text ('%C3' for example) and $&[1,2]
be the matched text without the leading percent sign ('C3'
). Then we call String#hex
to convert that hexadecimal number to a Fixnum (195
) and wrap it in an Array ([195]
) so that we can use Array#pack
to do the byte mangling for us. The problem is that pack
gives us a single binary byte:
> puts [195].pack('C').encoding
ASCII-8BIT
ASCII-8BIT编码也称为二进制(即没有特定编码的纯文本字节)。然后,该块返回该字节,并 String#gsub
尝试将 str
的UTF-8编码副本插入到 gsub
正在处理,并得到您的错误:
The ASCII-8BIT encoding is also known as "binary" (i.e. plain bytes with no particular encoding). Then the block returns that byte and String#gsub
tries to insert into the UTF-8 encoded copy of str
that gsub
is working on and you get your error:
不兼容的字符编码:ASCII-8BIT和UTF-8(Encoding :: CompatibilityError)
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
因为你不能(通常)把二进制字节填充到UTF-8字符串中;您可以随时使用:
because you can't (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:
URI.unescape("%C3%9F") # Works
URI.unescape("%C3µ") # Fails
URI.unescape("µ") # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ") # Fails
URI.unescape("%C3%9Fpancakes") # Works
一个简单的解决方法是将字符串切换为二进制,然后尝试对其进行解码:
One simple fix is to switch the string to binary before try to decode it:
def unescape(str, escaped = @regexp[:ESCAPED])
encoding = str.encoding
str = str.dup.force_encoding('binary')
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end
另一个选择是将 force_encoding
/ p>
Another option is to push the force_encoding
into the block:
def unescape(str, escaped = @regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end
我不知道为什么 gsub
在某些情况下失败,但在其他情况下却成功。
I'm not sure why the gsub
fails in some cases but succeeds in others.
这篇关于URI.unescape在尝试转换“%C3%9Fą”时崩溃到“ß”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!