Nokogiri Ruby HTML解析器 [英] Nokogiri Ruby HTML Parser

查看:132
本文介绍了Nokogiri Ruby HTML解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了Nokogiri在多个页面上遇到的问题。我需要能够首先根据合格的hrefs缩小搜索结果的范围。所以这里有一个脚本来获得我有兴趣获得的所有hrefs。但是,我无法解析出文章的标题,因此我可以链接到它们。我很高兴知道我可以手动检查这些元素,以便获得我想要的链接,并且每当我找到我想要的链接时,我也可以获取描述文章/ href的标题/文本,如

中的

 < a href .......>文字链接至< / a> 

这样我就可以用{:source =>..... :url =>.....,:title =>.....}。这是我迄今为止的脚本。它缩小了我有兴趣在哈希中设置的链接。

  require'nokogiri'
require'open -uri'

page =http://www.huffingtonpost.com/politics/

doc = Nokogiri :: HTML(open(page))
links = doc.css('a')
hrefs = links.map {| link | 。link.attribute( href属性)to_s} {.uniq.sort.delete_if | HREF | href.empty?}

hrefs.each do | h |
if h.reverse [0,9]!=stnemmoc#
if(h.reverse [0,7] ==scitilo)& (h.length> 65)
puts h
end
end
end

如果有人可以帮忙,也许可以解释一下,我可以先找到我想要的hrefs,然后根据先从hrefs过滤url来解析文本,就像我在这里一样,那会是非常好。还建议将这些Nokogiri脚本放在控制器中,然后在Rails中以这种方式发送到数据库中?我很感激。



谢谢

我不确定我完全理解你的问题,但我会将其解释为我如何提取链接并访问它们的属性?



只需修改选择器:

  links = doc.css('a [href]')

这将为您提供所有 a 元素,这些元素具有 href 。然后您可以遍历这些并访问它们的属性。


I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab the title/ text describing the article/href as in

<a href.......>Text Linked to</a>

so that I then have a hash with {:source => ".....", :url => ".....", :title => "....."}. Here is the script I have so far. It narrows down the links I am interested in having setup in the hash.

require 'nokogiri'
require 'open-uri'

page = "http://www.huffingtonpost.com/politics/"

doc = Nokogiri::HTML(open(page))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if{|href| href.empty?}

hrefs.each do |h|
    if h.reverse[0,9] != "stnemmoc#"
        if (h.reverse[0,7] == "scitilo") & (h.length > 65)
            puts h
        end
    end
end

If someone could help and maybe explain how it is that I can find the hrefs I want first and then parse the text based on filtering the urls from the hrefs first, as I have here, that would be really nice. Also is it recommended that these Nokogiri scripts are put in Controllers and then sent into the database that way in Rails? I appreciate it.

Thanks

解决方案

I'm not sure I understand your question completely, but I'm going to interpret it as "How do I extract links and access their attributes?"

Simply amend your selector:

links = doc.css('a[href]')

This will give you all a elements that have an href. You can then iterate over these and access their attributes.

这篇关于Nokogiri Ruby HTML解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆