将整数转换为UTF-8(韩文) [英] Converting integers to UTF-8 (Korean)

查看:337
本文介绍了将整数转换为UTF-8(韩文)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Ruby 1.9.2,并尝试修复一些损坏的UTF-8文本输入,其中文本实际上是"\\354\\203\\201\\355\\221\\234\\353\\252\\205",并将其更改为正确的韩文"상표명"

I'm running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205" and change it into its correct Korean "상표명"

但是,搜索了一段时间并尝试了几种方法后,我仍然感到胡言乱语. 这很令人困惑,因为第3行上的转义字符示例可以正常工作

However after searching for a while and trying a few methods I still get out gibberish. It's confusing as the escaped characters example on line 3 works fine

# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"

puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"

# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"

# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)

我确定必须在某个地方回答过这个问题,但是我没有找到它.

I'm sure this must have been answered somewhere but I haven't managed to find it.

推荐答案

这是您要获取UTF-8韩文的步骤:

This is what you want to do to get your UTF-8 Korean text:

s = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
k = s.scan(/\d+/).map { |n| n.to_i(8) }.pack("C*").force_encoding('utf-8')
# "상표명"

这是它的工作方式:

  1. 输入的字符串很好而且很常规,因此我们可以使用 scan 退出个人号码.
  2. 然后使用 pack('C*') 以获取字节字符串.此字符串将具有BINARY编码(又称为ASCII-8BIT).
  3. 我们碰巧知道字节确实代表了UTF-8,因此我们可以通过 force_encoding('utf-8') .
  1. The input string is nice and regular so we can use scan to pull out the individual number.
  2. Then a map with to_i(8) to convert the octal values (as noted by Henning Makholm) to integers.
  3. Now we need to convert our list of integers to bytes so we pack('C*') to get a byte string. This string will have the BINARY encoding (AKA ASCII-8BIT).
  4. We happen to know that the bytes really do represent UTF-8 so we can force the issue with force_encoding('utf-8').

您缺少的主要内容是pack格式; 'U'的意思是"UTF-8字符",并且期望一个Unicode代码点的数组,每个Unicode代码点都由一个整数表示,'C'期望的是字节数组,这就是我们所拥有的.

The main thing that you were missing was your pack format; 'U' means "UTF-8 character" and would expect an array of Unicode codepoints each represented by a single integer, 'C' expects an array of bytes and that's what we had.

这篇关于将整数转换为UTF-8(韩文)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆