Nokogiri html解析问题 [英] Nokogiri html parsing question

查看:47
本文介绍了Nokogiri html解析问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法弄清楚为什么我无法通过 nokogiri 正确解析关键字.在以下示例中,我的 href 链接文本功能正常运行,但无法弄清楚如何提取关键字.

I'm having trouble figuring out why I can't get keywords to parse properly through nokogiri. In the following example, I have the a href link text functionality working properly but cannot figure out how to pull the keywords.

这是我迄今为止的代码:

This is the code I have thus far:

.....

doc = Nokogiri::HTML(open("http://www.cnn.com"))
doc.xpath('//a/@href').each do |node|
#doc.xpath("//meta[@name='Keywords']").each do |node|

puts node.text

....

这成功地呈现了页面中的所有 a href 文本,但是当我尝试将它用于关键字时,它没有显示任何内容.我已经尝试了几种变体,但都没有运气.我认为节点后的.text"标注是错误的,但我不确定.

This successfully renders all of the a href text in the page, but when I try to use it for keywords it doesn't show anything. I've tried several variations of this with no luck. I assume that the the ".text" callout after node is wrong, but I'm not sure.

对于这段代码的粗糙程度,我深表歉意,我正在尽最大努力在这里学习.

My apologies for how rough this code is, I'm doing my best to learn here.

推荐答案

你说得对,问题是text.text 返回开始标记和结束标记之间的文本.由于元标签是空的,这会给你空字符串.您需要内容"属性的值.

You're correct, the problem is text. text returns the text between the opening tag and the closing tag. Since meta-tags are empty, this gives you the empty string. You want the value of the "content" attribute instead.

doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
  puts attr.value
end

既然你知道只有一个名为keywords"的元标签,你实际上不需要遍历结果,而是可以像这样直接取第一项:

Since you know that there will be only one meta-tag with the name "keywords", you don't actually need to loop through the results, but can take the first item directly like this:

puts doc.xpath("//meta[@name='Keywords']/@content").first.value

但是请注意,如果没有名为content"的元标记,这将导致错误,因此第一个选项可能更可取.

Note however, that this will cause an error if there is no meta-tag with the name "content", so the first option might be preferable.

这篇关于Nokogiri html解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆