如何将Net :: HTTP响应转换为Ruby 1.9.1中的某个编码? [英] How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?

查看:72
本文介绍了如何将Net :: HTTP响应转换为Ruby 1.9.1中的某个编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Sinatra应用程序( http://analyzethis.espace-technologies.com )以下


  1. 检索HTML页面(通过net / http)

  2. 创建一个Nokogiri文档response.body

  3. 提取一些信息并将其发回回应。响应应该是UTF-8编码

所以我在尝试阅读使用Windows-1256编码的网站时遇到了问题,如www .filfan.com或www.masrawy.com。



问题是编码转换的结果不正确,但没有发生错误。



net / http response.body.encoding给出ASCII-8BIT,不能转换为UTF-8



如果我做Nokogiri :: HTML(response.body)并使用CSS选择器从页面获取某些内容 - 例如标题标签的内容 - 我得到一个字符串,当我调用string.encoding返回WINDOWS-1256。我使用string.encode(utf-8)并发送响应,但是再次响应不正确。



关于我的错误的任何建议或想法方法?

解决方案

因为Net :: HTTP不能正确处理编码。请参阅 http://bugs.ruby-lang.org/issues/2567



您可以解析包含字符集的 response ['content-type'] ,而不是解析整个响应

然后使用 force_encoding()设置正确的编码。如果网站以UTF-8提供,则p>

response.body.force_encoding(UTF-8)


I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following

  1. Retrieve an HTML page (via net/http)
  2. Create a Nokogiri document from the response.body
  3. Extract some info and send it back in the response. The response should be UTF-8 encoded

So I came to the problem while trying to read sites that use windows-1256 encodings like www.filfan.com or www.masrawy.com.

The problem is the result of the encoding conversion is not correct though no errors are thrown.

The net/http response.body.encoding gives ASCII-8BIT which can not be converted to UTF-8

If I do Nokogiri::HTML(response.body) and use the css selectors to get certain content from the page - say the content of the title tag for example - I get a string which when i call string.encoding returns WINDOWS-1256. I use string.encode("utf-8") and send the response using that but again the response is not correct.

Any suggestions or ideas about what's wrong in my approach?

解决方案

Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567

You can parse response['content-type'] which contains charset instead of parsing whole response.body.

Then use force_encoding() to set right encoding.

response.body.force_encoding("UTF-8") if site is served in UTF-8.

这篇关于如何将Net :: HTTP响应转换为Ruby 1.9.1中的某个编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆