将转义的unicode(\\\Ž)转换为Ruby中的重音字符(Ž)? [英] Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

查看:121
本文介绍了将转义的unicode(\\\Ž)转换为Ruby中的重音字符(Ž)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常困难的时期:

I am having a very difficult time with this:

# contained within:
"MA\u008EEIKIAI"

# should be
"MAŽEIKIAI"

# nature of string
$ p string3
"MA\u008EEIKIAI" 

$ puts string3
MAEIKIAI

$ string3.inspect
"\"MA\\u008EEIKIAI\""

$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes> 

任何想法,从哪里开始?

Any ideas on where to start?

注意:这是不是我的上一个问题

推荐答案

\\\Ž 代码点 8e (以十六进制形式)的unicode字符出现在字符串的那一点。这个字符是控制字符SINGLE SHIFT TWO(参见 code chart(pdf))。角色Ž位于代码点 u017d 。但是在 Windows CP中的位置 8e -1252 编码。不知怎的,你的编码混合在一起。

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character "SINGLE SHIFT TWO" (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.

最简单的修复方法可能只是打开包含字符串(或数据库记录或其他内容)的文件)并编辑它是正确的。真正的解决方案将取决于所使用的字符串来自哪里,以及有多少个不良字符串。

The easiest way to "fix" this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.

假设字符串在 UTF-8编码 \\\Ž 将由两个字节 c2 8e 。请注意,第二个字节 8e 与CP-1252中Ž的编码相同。在转换字符串的方式将是这样的:

Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:

string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '')       # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8')          # convert to the desired encoding

请注意,一个解决所有问题的一般解决方案。并不是所有的CP-1252字符,当以UTF-8这种方式进行转换并以这种方式进行转换时。有些将是两个字节 c2 xx 其中 xx 正确的字节(如在这种情况下),其他将为 c3 yy 其中 yy 是一个不同的字节。

Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

这篇关于将转义的unicode(\\\Ž)转换为Ruby中的重音字符(Ž)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆