URI.unescape在尝试转换“%C3%9Fą”时崩溃到“ß” [英] URI.unescape crashes as it is trying to convert "%C3%9Fą" to "ßą"

查看:202
本文介绍了URI.unescape在尝试转换“%C3%9Fą”时崩溃到“ß”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 URI.unescape 来取消剪切文本,不幸的是我遇到了奇怪的错误:

I am using URI.unescape to unescape text, unfortunately I run into weird error:

 # encoding: utf-8
 require('uri')
 URI.unescape("%C3%9Fą")

会导致

 C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
    from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
    from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
    from exe/fail.rb:3:in `<main>'

为什么?

推荐答案

URI.unescape 的实现对于非ASCII输入断开。 1.9.3版本如下所示:

The implementation of URI.unescape is broken for non-ASCII inputs. The 1.9.3 version looks like this:

def unescape(str, escaped = @regexp[:ESCAPED])
  str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end

正则表达式使用 /%[a-fA-F\d] {2} / 。所以它通过字符串寻找百分号后跟两个十六进制数字;在 $& 中将是匹配的文本(例如'%C3')和 $& [1,2] 是没有前导百分号的匹配文本('C3')。然后我们调用 String#hex 将该十六进制数转换为Fixnum( 195 )并将其包装到数组中( [195] ),以便我们可以使用 Array#pack 为我们做字节磨练。问题是 pack 给我们一个二进制字节:

The regex in use is /%[a-fA-F\d]{2}/. So it goes through the string looking for a percent sign followed by two hex digits; in the block $& will be the matched text ('%C3' for example) and $&[1,2] be the matched text without the leading percent sign ('C3'). Then we call String#hex to convert that hexadecimal number to a Fixnum (195) and wrap it in an Array ([195]) so that we can use Array#pack to do the byte mangling for us. The problem is that pack gives us a single binary byte:

> puts [195].pack('C').encoding
ASCII-8BIT

ASCII-8BIT编码也称为二进制(即没有特定编码的纯文本字节)。然后,该块返回该字节,并 String#gsub 尝试将 str 的UTF-8编码副本插入到 gsub 正在处理,并得到您的错误:

The ASCII-8BIT encoding is also known as "binary" (i.e. plain bytes with no particular encoding). Then the block returns that byte and String#gsub tries to insert into the UTF-8 encoded copy of str that gsub is working on and you get your error:


不兼容的字符编码:ASCII-8BIT和UTF-8(Encoding :: CompatibilityError)

incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)

因为你不能(通常)把二进制字节填充到UTF-8字符串中;您可以随时使用:

because you can't (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:

URI.unescape("%C3%9F")         # Works
URI.unescape("%C3µ")           # Fails
URI.unescape("µ")              # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ")        # Fails
URI.unescape("%C3%9Fpancakes") # Works

一个简单的解决方法是将字符串切换为二进制,然后尝试对其进行解码:

One simple fix is to switch the string to binary before try to decode it:

def unescape(str, escaped = @regexp[:ESCAPED])
  encoding = str.encoding
  str = str.dup.force_encoding('binary')
  str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end

另一个选择是将 force_encoding / p>

Another option is to push the force_encoding into the block:

def unescape(str, escaped = @regexp[:ESCAPED])
  str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end

我不知道为什么 gsub 在某些情况下失败,但在其他情况下却成功。

I'm not sure why the gsub fails in some cases but succeeds in others.

这篇关于URI.unescape在尝试转换“%C3%9Fą”时崩溃到“ß”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆