用nokogiri搜寻网站的每一页 [英] DRY search every page of a site with nokogiri

查看:67
本文介绍了用nokogiri搜寻网站的每一页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想搜索站点的每个页面。我的想法是在页面上找到该域内的所有链接,然后访问它们并重复。我将不得不采取措施以免重复努力。

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.

所以它很容易开始:

page = 'http://example.com'
nf = Nokogiri::HTML(open(page))

links = nf.xpath '//a' #find all links on current page

main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq 

main_links现在是活动页面中以 /开头的链接的数组(应该仅是当前域上的链接)。

"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).

从这里,我可以将这些链接提供并阅读到上面类似的代码中,但是我不知道确保不重复自己的最佳方法。我想我会在访问它们时开始收集所有访问的链接:

From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:

main_links.each do |ml| 
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end

我仍在解决最后一点...但这似乎吗

I'm still working out this last bit... but does this seem like the proper approach?

谢谢。

推荐答案

其他建议您不要编写自己的网络搜寻器。如果性能和坚固性是您的目标,那么我同意这个。但是,这可能是一个很棒的学习练习。您是这样写的:

Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:


[…]但我不知道确保自己不会重复自己的最佳方法

递归是关键。类似于以下代码:

Recursion is the key here. Something like the following code:

require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'

def crawl_site( starting_at, &each_page )
  files = %w[png jpeg jpg gif svg txt js css zip gz]
  starting_uri = URI.parse(starting_at)
  seen_pages = Set.new                      # Keep track of what we've seen

  crawl_page = ->(page_uri) do              # A re-usable mini-function
    unless seen_pages.include?(page_uri)
      seen_pages << page_uri                # Record that we've seen this
      begin
        doc = Nokogiri.HTML(open(page_uri)) # Get the page
        each_page.call(doc,page_uri)        # Yield page and URI to the block

        # Find all the links on the page
        hrefs = doc.css('a[href]').map{ |a| a['href'] }

        # Make these URIs, throwing out problem ones like mailto:
        uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact

        # Pare it down to only those pages that are on the same site
        uris.select!{ |uri| uri.host == starting_uri.host }

        # Throw out links to files (this could be more efficient with regex)
        uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }

        # Remove #foo fragments so that sub-page links aren't differentiated
        uris.each{ |uri| uri.fragment = nil }

        # Recursively crawl the child URIs
        uris.each{ |uri| crawl_page.call(uri) }

      rescue OpenURI::HTTPError # Guard against 404s
        warn "Skipping invalid link #{page_uri}"
      end
    end
  end

  crawl_page.call( starting_uri )   # Kick it all off!
end

crawl_site('http://phrogz.net/') do |page,uri|
  # page here is a Nokogiri HTML document
  # uri is a URI instance with the address of the page
  puts uri
end

简而言之:


  • 跟踪您的哪些页面曾经使用 Set 。这样做不是通过 href 值,而是通过完整的规范URI。

  • 使用 URI.join 可以将相对于当前页面的相对路径转换为正确的URI。

  • 使用递归来不断抓取每个页面上的每个链接,但如果ve已经看过该页面。

  • Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
  • Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
  • Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.

这篇关于用nokogiri搜寻网站的每一页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆