ruby 1.9:UTF-8 中的无效字节序列 [英] ruby 1.9: invalid byte sequence in UTF-8

查看:46
本文介绍了ruby 1.9:UTF-8 中的无效字节序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Ruby (1.9) 编写一个爬虫,它会从许多随机站点中获取大量 HTML.
在尝试提取链接时,我决定只使用 .scan(/href="(.*?)"/i) 而不是 nokogiri/hpricot(主要加速).问题是我现在收到很多UTF-8 中的无效字节序列"错误.
据我了解,net/http 库没有任何特定于编码的选项,而且进来的东西基本上没有正确标记.
实际处理传入数据的最佳方式是什么?我尝试了 .encode 设置了替换和无效选项,但到目前为止没有成功......

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

推荐答案

在 Ruby 1.9.3 中,可以使用 String.encode 来忽略"无效的 UTF-8 序列.这是一个在 1.8 中都可以使用的片段(iconv) 和 1.9 (String#encode) :

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

或者如果你的输入真的很麻烦,你可以做一个从 UTF-8 到 UTF-16 再到 UTF-8 的双重转换:

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

这篇关于ruby 1.9:UTF-8 中的无效字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆