ruby(1.8.7):如何在刮擦时摆脱不可打印的字符? [英] ruby (1.8.7): How to get rid of non-printable chars while scraping?
问题描述
我正在尝试使用Nokogiri解析HTML页面,但是文本方面存在一些问题.主要来说,我无法摆脱多余的字符.解析时,当我获得一个String时,我总是尝试尽可能地清理它.我尝试将不可打印的字符转换为唯一的空格.经过大量修改后,我使用此方法未成功:
I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:
def clear_string(str)
CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end
例如,采用此HTML片段(复制自 http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525 )
For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)
<tr>
<td><span class="linkred2">Tramitació:</span></td>
<td> ordinària </td>
</tr>
Netbeans 7.0使用Nokogiri和clear_string
(上面定义的方法)显示的一些中间示例输出
Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string
(the method defined above)
row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]
我不知道为什么strip
不能去除第一个空格.此外,应用clear_string
之后的解析结果将使用YAML::dump
转储到yaml文件中.对于这两种文本,其内容分别为:
I don't know why strip
doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string
, is dumped into a yaml file using YAML::dump
. Its contents are respectively, for both texts:
"Tramitaci\xC3\xB3:"
!binary |
wqBvcmRpbsOgcmlh
第一种情况似乎还不错,但是我不知道如何解决第二种情况.
The first one seems barely OK, but I don't know how to fix the second case.
推荐答案
将字符从一个字符集转换为另一个字符集的一种方法是使用Iconv
.例如,如果您要查找的只是将UTF8转换为ASCII,则可以执行以下操作:
One way to translate characters from one character set to another is to use Iconv
. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:
require 'iconv'
s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"
TRANSLIT
开关告诉Iconv
尝试音译(近似匹配)不可转换的字符.如果您想完全忽略不可转换的字符,则可以使用IGNORE
开关:
The TRANSLIT
switch tells Iconv
to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE
switch:
Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"
请注意,如果Iconv
发现无法转换的内容,则会抛出TRANSLIT
异常.为此,您可以像这样组合IGNORE
和TRANSLIT
:
Note that Iconv
will throw an exception with TRANSLIT
if it finds something it can't convert. For that you can combine IGNORE
and TRANSLIT
like so:
Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"
这篇关于ruby(1.8.7):如何在刮擦时摆脱不可打印的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!