将整数转换为UTF-8(韩文) [英] Converting integers to UTF-8 (Korean)
问题描述
我正在运行Ruby 1.9.2,并尝试修复一些损坏的UTF-8文本输入,其中文本实际上是"\\354\\203\\201\\355\\221\\234\\353\\252\\205"
,并将其更改为正确的韩文"상표명"
I'm running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
and change it into its correct Korean "상표명"
但是,搜索了一段时间并尝试了几种方法后,我仍然感到胡言乱语. 这很令人困惑,因为第3行上的转义字符示例可以正常工作
However after searching for a while and trying a few methods I still get out gibberish. It's confusing as the escaped characters example on line 3 works fine
# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"
puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"
# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)
我确定必须在某个地方回答过这个问题,但是我没有找到它.
I'm sure this must have been answered somewhere but I haven't managed to find it.
推荐答案
这是您要获取UTF-8韩文的步骤:
This is what you want to do to get your UTF-8 Korean text:
s = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
k = s.scan(/\d+/).map { |n| n.to_i(8) }.pack("C*").force_encoding('utf-8')
# "상표명"
这是它的工作方式:
- 输入的字符串很好而且很常规,因此我们可以使用
scan
退出个人号码. - 然后使用
pack('C*')
以获取字节字符串.此字符串将具有BINARY
编码(又称为ASCII-8BIT
). - 我们碰巧知道字节确实代表了UTF-8,因此我们可以通过
force_encoding('utf-8')
.
- The input string is nice and regular so we can use
scan
to pull out the individual number. - Then a
map
withto_i(8)
to convert the octal values (as noted by Henning Makholm) to integers. - Now we need to convert our list of integers to bytes so we
pack('C*')
to get a byte string. This string will have theBINARY
encoding (AKAASCII-8BIT
). - We happen to know that the bytes really do represent UTF-8 so we can force the issue with
force_encoding('utf-8')
.
您缺少的主要内容是pack
格式; 'U'
的意思是"UTF-8字符",并且期望一个Unicode代码点的数组,每个Unicode代码点都由一个整数表示,'C'
期望的是字节数组,这就是我们所拥有的.
The main thing that you were missing was your pack
format; 'U'
means "UTF-8 character" and would expect an array of Unicode codepoints each represented by a single integer, 'C'
expects an array of bytes and that's what we had.
这篇关于将整数转换为UTF-8(韩文)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!