Rails v3/Ruby 1.9.2 中的字符编码问题 [英] Character Encoding issue in Rails v3/Ruby 1.9.2

查看:13
本文介绍了Rails v3/Ruby 1.9.2 中的字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我从文件中读取内容时,我有时会收到此错误UTF-8 中的字节序列无效".注意 - 这仅在字符串中有一些特殊字符时发生.我尝试在没有r:UTF-8"的情况下打开文件,但仍然出现相同的错误.

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.

open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error

文件内容:

# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works

这是我从外面得到的 CSV 文件,我正在尝试将它导入我的数据库,它的顶部没有# encoding: UTF-8",但我添加了这个,因为我在某处阅读它会解决了这个问题,但它没有.:(

This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(

环境:

  • Rails v3.0.3
  • ruby 1.9.2p0(2010-08-18 修订版 29036)[x86_64-darwin10.5.0]
  • Rails v3.0.3
  • ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]

推荐答案

Ruby 对每个文件都有一个外部编码和内部编码的概念.这允许您在源中使用 UTF-8 格式的文件,即使该文件以更深奥的格式存储.如果您的默认外部编码是 UTF-8(如果您使用的是 Mac OS X),那么您的所有文件 I/O 也将是 UTF-8.您可以使用 File.open('file').external_encoding 进行检查.当您打开文件并传递 "r:UTF-8" 时,您正在执行的操作是强制使用 Ruby 默认使用的相同外部编码.

Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.

很可能,您的源文档不是 UTF-8,并且那些非 ascii 字符没有完全映射到 UTF-8(如果是,您要么得到正确的字符,要么没有错误,如果它们不正确地映射,您将得到不正确的字符并且没​​有错误).您应该做的是尝试确定源文档的编码,然后让 Ruby 在读取时对文档进行转码,如下所示:

Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:

File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }

如果您需要帮助确定源代码的编码,请试一试这个 Python 库.它基于 Seamonkey/Mozilla 中的自动字符集检测回退(并且可能仍在 Firefox 中).

If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).

这篇关于Rails v3/Ruby 1.9.2 中的字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆