清理ruby中的奇怪编码 [英] clean up strange encoding in ruby
问题描述
我目前正在使用couchdb播放。
我试图将一些博客数据从redis(键值存储)迁移到couchdb(键值存储)。
看到我可能将这个数据迁移到不同的博客引擎(每个人都有一个爱好:)一个gazillion的时间,似乎有一些编码snafus。
我使用CouchREST从ruby访问CouchDB我得到这个:
< JSON :: GeneratorError:source sequence is illegal / malformed>
问题似乎是对象的body_html部分:
< Post:0x00000000e9ee18 @body_html =[...] Wie Sie bereits wissen,m\xF6chte EUserv k\xFCnftig seine [... ]
这些应该是Umlauts(möchte和künftig
任何想法如何摆脱这些问题?我尝试一些转换使用ruby 1.9编码功能或iconv插入前,但还没有运气:(
如果我试图使用ruby 1.9的.encode()方法将这些东西转换为ISO-8859-1,这是发生了什么(不同的文本,同样的问题) p>
#< Encoding :: UndefinedConversionError:\xC6\x92from UTF-8 to ISO-8859-1>
那个东西到ISO-8859-1
关闭,你实际上想做另一种方式:已获得 ISO-8859-1(*),您可以 使用UTF-8(**)。因此, str.encode('utf-8','iso-8859-1')
将更有可能做到这一点。
*:实际上你可能有Windows代码页1252,这就像ISO-8859-1,但额外的智能报价和事物在范围0x80-0x9F ISO-8859-1用于控制代码。如果是,请改用'cp1252'
。
**: >做。使用UTF-8是最好的方法,所以你可以存储所有可能的字符。如果你想要继续在ISO-8859-1 / cp1252中工作,那么大概的问题只是Ruby猜测使用中的字符集,你可以通过调用 str.force_encoding('iso-8859-1')
。
I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:
<JSON::GeneratorError: source sequence is illegal/malformed>
the problem seems to be the body_html part of the object:
<Post:0x00000000e9ee18 @body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine [...]
Those are supposed to be Umlauts ("möchte" and "künftig").
Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(
If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):
#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>
I try to e.g. convert that stuff to ISO-8859-1
Close. You actually want to do it the other way around: you've got ISO-8859-1(*), you want UTF-8(**). So str.encode('utf-8', 'iso-8859-1')
would be more likely to do the trick.
*: actually you might well have Windows code page 1252, which is like ISO-8859-1, but with extra smart-quotes and things in the range 0x80-0x9F which ISO-8859-1 uses for control codes. If so, use 'cp1252'
instead.
**: well, you probably do. Working with UTF-8 is the best way forward so you can store all possible characters. If you really want to keep working in ISO-8859-1/cp1252, then presumably the problem is just that Ruby has mis-guessed the character set in use and you can fix it by calling str.force_encoding('iso-8859-1')
.
这篇关于清理ruby中的奇怪编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!