ruby 1.9,force_encoding,但检查 [英] ruby 1.9, force_encoding, but check

查看:109
本文介绍了ruby 1.9,force_encoding,但检查的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串,我从某种输入中读取。



据我所知,这是UTF8。好的:

  string.force_encoding(utf8)

但是,如果这个字符串中有字节,它实际上是不合法的UTF8,我想知道现在并采取行动。



通常,如果遇到这样的字节,会强制执行force_encoding(utf8)吗?我相信它不会。



如果我正在做#encode 我可以从方便的选项中选择与在源编码(或目的地编码)中无效的字符如何处理。



但我没有做一个#encode,我做一个#force_encoding。它没有这样的选择。



  string.force_encoding(utf8 ).encode(utf8)

通常从 utf8 utf8编码没有任何意义。但也许这是让它立即提高,如果有无效的字节的方式?或者使用:replace 选项等做一些与无效字节不同的操作?



但不,做这项工作。



有人知道吗?

  1.9.3-p0:032> ; a =bad:\xc3\x28 okay.force_encoding(utf-8)
=> bad:\ xC3(okay
1.9.3-p0:033> a.valid_encoding?
=> false
  1.9.3-p0:035> a.encode(utf-8)
=>bad:\ xC3(okay



如果我转换为不同的编码,它会!

  1.9.3-p0:039> a.encode(ISO-8859-1)
Encoding :: InvalidByteSequenceError:\xC3 8

或者如果我告诉它,它会替换为?=> / p>

  1.9.3-p0:040> a.encode(ISO-8859-1,:invalid =>:替换)
=>bad:?(okay

所以ruby的聪明要知道utf-8中有什么是坏字节,并且在转换为其他编码时将其替换为其他字符,但我不会要转换为不同的编码,我想保留utf8 - 但我可能想提高,如果有一个无效的字节,或者我可能想替换无效字节替换字符。



有没有办法让ruby做到这一点?



update 我相信这最终添加到ruby在2.1,与String#scrub目前在2.1预览版本中做到这一点。所以寻找吧!

解决方案

(更新:请参阅 https://github.com/jrochkind/scrub_rb



所以我编写了一个我需要的解决方案: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib /ensure_valid_encoding.rb



但是最近我才意识到这实际上是在stdlib内置的,你只需要,有点反直觉,传递'binary'as thesource encoding:

  a =bad:\xc3\x28 okay.force_encoding utf-8)
a.encode(utf-8,binary,:undef =>:replace)
=> bad: (okay

是的,这正是我想要的。虽然我在某个博客上看到这些参数使用了这样的方式,所以有人知道了这一点。


I have a string I have read from some kind of input.

To the best of my knowledge, it is UTF8. Okay:

string.force_encoding("utf8")

But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.

Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believe it will not.

If I was doing an #encode I could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).

But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.

Would it make sense to

string.force_encoding("utf8").encode("utf8")

to get an exception right away? Normally encoding from utf8 to utf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replace option etc to do something different with invalid bytes?

But no, can't seem to make that work either.

Anyone know?

1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false

Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:

1.9.3-p0 :035 > a.encode("utf-8")
 => "bad: \xC3( okay"

If I was converting to a different encoding, it would!

1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8

Or if I told it to, it'd replace it with a "?" =>

1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"

So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't want to convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.

Isn't there some way to get ruby to do this?

update I believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!

解决方案

(update: see https://github.com/jrochkind/scrub_rb)

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

这篇关于ruby 1.9,force_encoding,但检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆