用nokogiri搜寻网站的每一页 [英] DRY search every page of a site with nokogiri

查看：67 发布时间：2020/10/26 22:49:08 ruby web-scraping web-crawler nokogiri dry

本文介绍了用nokogiri搜寻网站的每一页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想搜索站点的每个页面。我的想法是在页面上找到该域内的所有链接，然后访问它们并重复。我将不得不采取措施以免重复努力。

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.

所以它很容易开始：

page = 'http://example.com'
nf = Nokogiri::HTML(open(page))

links = nf.xpath '//a' #find all links on current page

main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq

main_links现在是活动页面中以 /开头的链接的数组（应该仅是当前域上的链接）。

"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).

从这里，我可以将这些链接提供并阅读到上面类似的代码中，但是我不知道确保不重复自己的最佳方法。我想我会在访问它们时开始收集所有访问的链接：

From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:

main_links.each do |ml| 
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end

我仍在解决最后一点...但这似乎吗

I'm still working out this last bit... but does this seem like the proper approach?

谢谢。

推荐答案

其他建议您不要编写自己的网络搜寻器。如果性能和坚固性是您的目标，那么我同意这个。但是，这可能是一个很棒的学习练习。您是这样写的：

Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:

[…]但我不知道确保自己不会重复自己的最佳方法

递归是关键。类似于以下代码：

Recursion is the key here. Something like the following code:

require 'set' require 'uri' require 'nokogiri' require 'open-uri' def crawl_site( starting_at, &each_page ) files = %w[png jpeg jpg gif svg txt js css zip gz] starting_uri = URI.parse(starting_at) seen_pages = Set.new # Keep track of what we've seen crawl_page = ->(page_uri) do # A re-usable mini-function unless seen_pages.include?(page_uri) seen_pages << page_uri # Record that we've seen this begin doc = Nokogiri.HTML(open(page_uri)) # Get the page each_page.call(doc,page_uri) # Yield page and URI to the block # Find all the links on the page hrefs = doc.css('a[href]').map{ |a| a['href'] } # Make these URIs, throwing out problem ones like mailto: uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact # Pare it down to only those pages that are on the same site uris.select!{ |uri| uri.host == starting_uri.host } # Throw out links to files (this could be more efficient with regex) uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } } # Remove #foo fragments so that sub-page links aren't differentiated uris.each{ |uri| uri.fragment = nil } # Recursively crawl the child URIs uris.each{ |uri| crawl_page.call(uri) } rescue OpenURI::HTTPError # Guard against 404s warn "Skipping invalid link #{page_uri}" end end end crawl_page.call( starting_uri ) # Kick it all off! end crawl_site('http://phrogz.net/') do |page,uri| # page here is a Nokogiri HTML document # uri is a URI instance with the address of the page puts uri end

简而言之：

跟踪您的哪些页面曾经使用 Set 。这样做不是通过 href 值，而是通过完整的规范URI。

使用 URI.join 可以将相对于当前页面的相对路径转换为正确的URI。

使用递归来不断抓取每个页面上的每个链接，但如果ve已经看过该页面。

Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.

Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.

Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.

这篇关于用nokogiri搜寻网站的每一页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用nokogiri搜寻网站的每一页 [英] DRY search every page of a site with nokogiri

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用nokogiri搜寻网站的每一页 [英] DRY search every page of a site with nokogiri

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭