ruby 1.9:UTF-8中的无效字节序列 [英] ruby 1.9: invalid byte sequence in UTF-8

查看:161
本文介绍了ruby 1.9:UTF-8中的无效字节序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Ruby(1.9)中编写一个爬虫,它从大量随机网站中消耗大量的HTML。

当尝试提取链接时,我决定使用 .scan(/ href =(。*?)/ i)而不是nokogiri / hpricot(主要加速)。问题是我现在收到了很多UTF-8中的无效字节序列错误。

从我的理解中, net / http 库没有任何编码特定的选项,并且进来的东西基本上没有被正确的标记。

什么是最好的实际工作方式与传入的数据?我试过 .encode 与替换和无效的选项设置,但没有成功迄今为止...

解决方案

在Ruby 1.9.3中,可以使用String.encode忽略无效的UTF-8序列。以下是一个可在1.8( iconv)中使用的代码段)和1.9( String #encode ):

  require'iconv'除非String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8','UTF-8',:invalid =>:replace)
else
ic = Iconv。 new('UTF-8','UTF-8 // IGNORE')
file_contents = ic.iconv(file_contents)
end

或者如果您真的很麻烦的输入,您可以从UTF-8到UTF-16进行双重转换,并返回到UTF-8:

  require'iconv'除非String.method_defined?(:encode)
如果String.method_defined?(:encode)
file_contents.encode! 'UTF-16','UTF-8',:invalid =>:replace,:replace =>'')
file_contents.encode ('UTF-8','UTF-16')
else
ic = Iconv.new('UTF-8','UTF-8 // IGNORE')
file_contents = ic.iconv(file_contents)
end


I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

解决方案

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

这篇关于ruby 1.9:UTF-8中的无效字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆