Ruby 将字符串编码从 ISO-8859-1 转换为 UTF-8 不起作用 [英] Ruby converting string encoding from ISO-8859-1 to UTF-8 not working
问题描述
我正在尝试将字符串从 ISO-8859-1 编码转换为 UTF-8,但我似乎无法让它工作.这是我在 irb 中所做的一个例子.
I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "NorrlandsvxC3xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
我不确定为什么 iso-8859-1 中的 Norrlandsvägen 会被转换为 utf-8 中的 Norrlandsvägen.
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
我尝试了 encode、encode!、encode(destinationEncoding, originalEncoding)、iconv、force_encoding 以及我能想到的各种奇怪的解决方法,但似乎没有任何效果.有人可以帮助我/指出我正确的方向吗?
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby 新手仍然疯狂地拉头发,但对这里的所有回复感到感激...... :)
这个问题的背景:我正在编写一个 gem,它将从一些网站(它将具有 iso-8859-1 编码)下载一个 xml 文件并将其保存在存储中,我想将其转换为 utf-8第一的.但是诸如Norrlandsvägen 之类的词总是让我感到困惑.真的任何帮助将不胜感激!
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[更新]:我意识到在 irb 控制台中运行这样的测试可能会给我不同的行为,所以这是我在实际代码中的内容:
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
但最后一行给了我以下错误:
but the last line gives me the following error:
Encoding::UndefinedConversionError - "xC3" from ASCII-8BIT to UTF-8
感谢下面@Amadan 的回答,我注意到 xC3
实际上会出现在 irb 中:
Thanks to @Amadan's answer below, I noticed that xC3
actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "xC3xA4"
我也尝试为 string.encode(originalEncoding)
的结果分配一个新变量,但得到了一个更奇怪的错误:
I have also tried to assign a new variable to the result of string.encode(originalEncoding)
but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
并且错误是 Encoding::UndefinedConversionError - "xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
我仍然很迷失在所有这些编码混乱中,但我真的很感谢大家给我的所有回复和帮助!万分感谢!:)
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
推荐答案
您分配一个 UTF-8 字符串.它包含ä
.UTF-8 用两个字节表示 ä
.
You assign a string, in UTF-8. It contains ä
. UTF-8 represents ä
with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
然后您强制将字节解释为好像它们是 ISO-8859-1,而不实际更改底层表示.这不再包含 ä
.它包含两个字符,Ã
和 ¤
.
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä
any more. It contains two characters, Ã
and ¤
.
string.force_encoding('iso-8859-1')
# => "xC3xA4"
string.length
# 2
string.bytes
# [195, 164]
然后你把它翻译成UTF-8
.由于这不是重新解释而是翻译,因此您保留两个字符,但现在以 UTF-8 编码:
Then you translate that into UTF-8
. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
您所缺少的是,您最初没有有一个 ISO-8859-1 字符串,就像您在 Web 服务中所拥有的那样 - 您有胡言乱语.幸运的是,这一切都在您的控制台测试中;如果您使用正确的输入编码读取网站的响应,它应该一切正常.
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
对于您的控制台测试,让我们演示一下,如果您以正确的 ISO-8859-1 字符串开头,则一切正常:
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "NorrlandsvxE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
编辑对于您的具体问题,这应该有效:
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
这篇关于Ruby 将字符串编码从 ISO-8859-1 转换为 UTF-8 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!