ruby 1.9:UTF-8中的无效字节序列 [英] ruby 1.9: invalid byte sequence in UTF-8
问题描述
当尝试提取链接时,我决定使用
.scan(/ href =(。*?)/ i)
而不是nokogiri / hpricot(主要加速)。问题是我现在收到了很多UTF-8中的无效字节序列
错误。从我的理解中,
net / http
库没有任何编码特定的选项,并且进来的东西基本上没有被正确的标记。什么是最好的实际工作方式与传入的数据?我试过
.encode
与替换和无效的选项设置,但没有成功迄今为止... 在Ruby 1.9.3中,可以使用String.encode忽略无效的UTF-8序列。以下是一个可在1.8( iconv)中使用的代码段)和1.9( String #encode ):
require'iconv'除非String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8','UTF-8',:invalid =>:replace)
else
ic = Iconv。 new('UTF-8','UTF-8 // IGNORE')
file_contents = ic.iconv(file_contents)
end
或者如果您真的很麻烦的输入,您可以从UTF-8到UTF-16进行双重转换,并返回到UTF-8:
require'iconv'除非String.method_defined?(:encode)
如果String.method_defined?(:encode)
file_contents.encode! 'UTF-16','UTF-8',:invalid =>:replace,:replace =>'')
file_contents.encode ('UTF-8','UTF-16')
else
ic = Iconv.new('UTF-8','UTF-8 // IGNORE')
file_contents = ic.iconv(file_contents)
end
I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i)
instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8
" errors.
From what I understood, the net/http
library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode
with the replace and invalid options set, but no success so far...
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
这篇关于ruby 1.9:UTF-8中的无效字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!