Nokogiri,open-uri和Unicode字符 [英] Nokogiri, open-uri, and Unicode Characters

查看:107
本文介绍了Nokogiri,open-uri和Unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Nokogiri和open-uri来获取网页上标题标签的内容,但是遇到重音字符时遇到了麻烦.处理这些的最佳方法是什么?这是我在做什么:

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

此时,标题看起来像这样:

At this point, the title looks like this:

Rag \ 303 \ 271

Rag\303\271

代替:

Ragù

我如何让nokogiri返回正确的字符(例如,本例中为ù)?

How can I have nokogiri return the proper character (e.g. ù in this case)?

以下是示例网址:

http://www.epicurious.com/食谱/食物/观看次数/Tagliatelle-with-Duck-Ragu-242037

推荐答案

当您说看起来像这样"时,您是否正在查看此值IRB?它将使用表示字符的字节序列的C样式转义来转义非ASCII范围字符.

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

如果使用puts打印它们,则可以按预期将它们放回原处,前提是您的shell控制台使用与所讨论字符串相同的编码(在这种情况下,显然是UTF-8,基于返回的两个字节)该字符).如果将值存储在文本文件中,则打印到句柄也应导致UTF-8序列.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

如果您需要在UTF-8和其他编码之间进行转换,具体取决于您使用的是Ruby 1.9还是1.8.6.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

对于1.9: http://blog.grayproductions.net/articles/ruby_19s_string 对于1.8,您可能需要查看Iconv.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string for 1.8, you probably need to look at Iconv.

此外,如果您需要与Windows中的COM组件进行交互,则需要告诉ruby使用正确的编码,如下所示:

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

如果要与mysql交互,则需要将表上的排序规则设置为支持您正在使用的编码的排序规则.通常,最好将排序规则设置为UTF-8,即使您的某些内容以其他编码返回.您只需要根据需要进行转换.

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri具有一些用于处理不同编码的功能(可能通过Iconv),但是我对此还有些欠缺,因此,我将其解释留给其他人.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

这篇关于Nokogiri,open-uri和Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆