ruby(1.8.7):如何在刮擦时摆脱不可打印的字符? [英] ruby (1.8.7): How to get rid of non-printable chars while scraping?

查看:108
本文介绍了ruby(1.8.7):如何在刮擦时摆脱不可打印的字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Nokogiri解析HTML页面,但是文本方面存在一些问题.主要来说,我无法摆脱多余的字符.解析时,当我获得一个String时,我总是尝试尽可能地清理它.我尝试将不可打印的字符转换为唯一的空格.经过大量修改后,我使用此方法未成功:

I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:

def clear_string(str)
  CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end

例如,采用此HTML片段(复制自 http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525 )

For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)

<tr>
    <td><span class="linkred2">Tramitaci&oacute;:</span></td>
    <td>&nbsp;ordinària </td>
</tr>

Netbeans 7.0使用Nokogiri和clear_string(上面定义的方法)显示的一些中间示例输出

Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)

row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]

我不知道为什么strip不能去除第一个空格.此外,应用clear_string之后的解析结果将使用YAML::dump转储到yaml文件中.对于这两种文本,其内容分别为:

I don't know why strip doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:

"Tramitaci\xC3\xB3:"
!binary |
  wqBvcmRpbsOgcmlh

第一种情况似乎还不错,但是我不知道如何解决第二种情况.

The first one seems barely OK, but I don't know how to fix the second case.

推荐答案

将字符从一个字符集转换为另一个字符集的一种方法是使用Iconv.例如,如果您要查找的只是将UTF8转换为ASCII,则可以执行以下操作:

One way to translate characters from one character set to another is to use Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:

require 'iconv'

s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"

TRANSLIT开关告诉Iconv尝试音译(近似匹配)不可转换的字符.如果您想完全忽略不可转换的字符,则可以使用IGNORE开关:

The TRANSLIT switch tells Iconv to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE switch:

Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"

请注意,如果Iconv发现无法转换的内容,则会抛出TRANSLIT异常.为此,您可以像这样组合IGNORETRANSLIT:

Note that Iconv will throw an exception with TRANSLIT if it finds something it can't convert. For that you can combine IGNORE and TRANSLIT like so:

Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"

这篇关于ruby(1.8.7):如何在刮擦时摆脱不可打印的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆