Nokogiri、open-uri 和 Unicode 字符 [英] Nokogiri, open-uri, and Unicode Characters

查看:28
本文介绍了Nokogiri、open-uri 和 Unicode 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Nokogiri 和 open-uri 来获取网页上标题标签的内容,但在处理重音字符时遇到了问题.处理这些问题的最佳方法是什么?这是我正在做的:

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

此时,标题是这样的:

抹布303271

代替:

抹布

我怎样才能让 nokogiri 返回正确的字符(例如在这种情况下是ù)?

How can I have nokogiri return the proper character (e.g. ù in this case)?

这是一个示例网址:

http://www.epicurious.com/食谱/食物/意见/意大利面和鸭子-拉古-242037

推荐答案

当您说看起来像这样"时,您是否正在查看这个值 IRB?它将使用表示字符的字节序列的 C 样式转义来转义非 ASCII 范围字符.

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

如果你用 puts 打印它们,你会像你期望的那样得到它们,假设你的 shell 控制台使用与有问题的字符串相同的编码(在这种情况下显然是 UTF-8,基于返回的两个字节)那个字符).如果您将值存储在文本文件中,则打印到句柄也应生成 UTF-8 序列.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

如果您需要在 UTF-8 和其他编码之间进行转换,具体取决于您使用的是 Ruby 1.9 还是 1.8.6.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

对于 1.9:http://blog.grayproductions.net/articles/ruby_19s_string对于 1.8,您可能需要查看 Iconv.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string for 1.8, you probably need to look at Iconv.

此外,如果您需要与 Windows 中的 COM 组件交互,您需要告诉 ruby​​ 使用正确的编码,如下所示:

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

如果您正在与 mysql 交互,则需要将表上的排序规则设置为支持您正在使用的编码的排序规则.通常,最好将排序规则设置为 UTF-8,即使您的某些内容以其他编码返回;您只需要根据需要进行转换.

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri 有一些处理不同编码的功能(可能是通过 Iconv),但我有点不习惯,所以我会把解释留给其他人.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

这篇关于Nokogiri、open-uri 和 Unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆