Nokogiri Ruby HTML解析器 [英] Nokogiri Ruby HTML Parser

查看：132 发布时间：2018/6/21 17:02:44 html ruby-on-rails ruby screen-scraping nokogiri

本文介绍了Nokogiri Ruby HTML解析器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了Nokogiri在多个页面上遇到的问题。我需要能够首先根据合格的hrefs缩小搜索结果的范围。所以这里有一个脚本来获得我有兴趣获得的所有hrefs。但是，我无法解析出文章的标题，因此我可以链接到它们。我很高兴知道我可以手动检查这些元素，以便获得我想要的链接，并且每当我找到我想要的链接时，我也可以获取描述文章/ href的标题/文本，如

中的

 < a href .......>文字链接至< / a>

这样我就可以用{：source =>..... ：url =>.....，：title =>.....}。这是我迄今为止的脚本。它缩小了我有兴趣在哈希中设置的链接。

  require'nokogiri'
 require'open -uri'
 
 page =http://www.huffingtonpost.com/politics/
 
 doc = Nokogiri :: HTML（open（page））
 links = doc.css（'a'）
 hrefs = links.map {| link | 。link.attribute（ href属性）to_s} {.uniq.sort.delete_if | HREF | href.empty？} 
 
 hrefs.each do | h | 
 if h.reverse [0,9]！=stnemmoc＃
 if（h.reverse [0,7] ==scitilo）& （h.length> 65）
 puts h 
 end 
 end 
 end

如果有人可以帮忙，也许可以解释一下，我可以先找到我想要的hrefs，然后根据先从hrefs过滤url来解析文本，就像我在这里一样，那会是非常好。还建议将这些Nokogiri脚本放在控制器中，然后在Rails中以这种方式发送到数据库中？我很感激。

谢谢

我不确定我完全理解你的问题，但我会将其解释为我如何提取链接并访问它们的属性？

只需修改选择器：
links = doc.css（'a [href]'）
这将为您提供所有 a 元素，这些元素具有 href 。然后您可以遍历这些并访问它们的属性。

I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab the title/ text describing the article/href as in
<a href.......>Text Linked to</a>
so that I then have a hash with {:source => ".....", :url => ".....", :title => "....."}. Here is the script I have so far. It narrows down the links I am interested in having setup in the hash.
require 'nokogiri' require 'open-uri' page = "http://www.huffingtonpost.com/politics/" doc = Nokogiri::HTML(open(page)) links = doc.css('a') hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if{|href| href.empty?} hrefs.each do |h| if h.reverse[0,9] != "stnemmoc#" if (h.reverse[0,7] == "scitilo") & (h.length > 65) puts h end end end
If someone could help and maybe explain how it is that I can find the hrefs I want first and then parse the text based on filtering the urls from the hrefs first, as I have here, that would be really nice. Also is it recommended that these Nokogiri scripts are put in Controllers and then sent into the database that way in Rails? I appreciate it.

Thanks
解决方案
I'm not sure I understand your question completely, but I'm going to interpret it as "How do I extract links and access their attributes?"

Simply amend your selector:
links = doc.css('a[href]')
This will give you all a elements that have an href. You can then iterate over these and access their attributes.

这篇关于Nokogiri Ruby HTML解析器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Nokogiri Ruby HTML解析器 [英] Nokogiri Ruby HTML Parser

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Nokogiri Ruby HTML解析器 [英] Nokogiri Ruby HTML Parser

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭