Ruby+Anemone Web Crawler:正则表达式匹配以一系列数字结尾的 URL [英] Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

查看:55
本文介绍了Ruby+Anemone Web Crawler:正则表达式匹配以一系列数字结尾的 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我试图抓取一个网站并跳过一个像这样结束的页面:

Suppose I was trying crawl a website a skip a page that ended like so:

http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117

我目前正在 Ruby 中使用 Anemone gem 来构建爬虫.我正在使用 skip_links_like 方法,但我的模式似乎永远不会匹配.我试图使其尽可能通用,因此它不依赖于子页面,而仅依赖于 =2105925(数字).

I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits).

我尝试过 /=\d+$//\?.*\d+$/ 但它似乎不起作用.

I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working.

这类似于跳过带有扩展名 pdf、zip 的网页,从 Anemone 中爬行,但我不能用数字代替扩展名.

This similar to Skipping web-pages with extension pdf, zip from crawling in Anemone but I can't make it worth with digits instead of extensions.

此外,使用模式 =\d+$ 对 http://regexpal.com/ 进行测试 将成功匹配 http://misc.com/test/index.php?page=news&subpage=20060118

Also, testing on http://regexpal.com/ with the pattern =\d+$ will successfully match http://misc.com/test/index.php?page=news&subpage=20060118

这是我的全部代码.我想知道是否有人能确切地看到哪里出了问题.

Here is the entirety of my code. I wonder if anyone can see exactly what's wrong.

require 'anemone'
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone|
  anemone.skip_links_like /\?.*\d+$/
  anemone.on_every_page do |page|
    pURL = page.url.to_s
    puts "Now checking: " + pURL
    bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
    puts "Successfully checked"
  end
end

我的输出是这样的:

...
Now checking: http://MISC.com/about_us/index.php?page=press_and_news&subpage=20110711
Successfully checked
...

推荐答案

  Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
   anemone.on_every_page do |page|
     pURL = page.url.to_s
     puts "Now checking: " + pURL
      bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
     puts "Successfully checked"
   end
 end

这篇关于Ruby+Anemone Web Crawler:正则表达式匹配以一系列数字结尾的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆