Nutch Crawling 不适用于特定 URL [英] Nutch Crawling not working for particular URL

查看:50
本文介绍了Nutch Crawling 不适用于特定 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 apache nutch 进行爬行.当我抓取页面 http://www.google.co.in 时.它正确抓取页面并生成结果.但是当我在该 url 中添加一个参数时,它不会为该 url http://www.google.co.in/search?q=bill+gates 获取任何结果.

I am using apache nutch for crawling. When i crawled the page http://www.google.co.in. It crawls the page correctly and produce the results. But when i add one parameter in that url it does not fetch any results for the url http://www.google.co.in/search?q=bill+gates.

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 100
Injector: starting at 2013-05-27 08:01:57
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-05-27 08:02:11, elapsed: 00:00:14
Generator: starting at 2013-05-27 08:02:11
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130527080219
Generator: finished at 2013-05-27 08:02:26, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-05-27 08:02:26
Fetcher: segment: crawl/segments/20130527080219
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.google.co.in/search?q=bill+gates
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-05-27 08:02:33, elapsed: 00:00:07
ParseSegment: starting at 2013-05-27 08:02:33
ParseSegment: segment: crawl/segments/20130527080219
ParseSegment: finished at 2013-05-27 08:02:40, elapsed: 00:00:07
CrawlDb update: starting at 2013-05-27 08:02:40
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130527080219]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-05-27 08:02:54, elapsed: 00:00:13
Generator: starting at 2013-05-27 08:02:54
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-05-27 08:03:01
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/muthu/workspace/webcrawler/crawl/segments/20130527080219
LinkDb: finished at 2013-05-27 08:03:08, elapsed: 00:00:07
crawl finished: crawl

我已经添加了代码

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

为什么会这样?如果我添加参数可以获取网址吗?预先感谢您的帮助.

Why this happens? can fetch the urls if i add parameter? Thanks in advance for your Help.

推荐答案

Nutch crawler 遵守 robots.txt,如果您看到位于 http://www.google.co.in/robots.txt 你会发现/search 不允许抓取.

Nutch crawler obeys robots.txt and if you see robots.txt located on http://www.google.co.in/robots.txt you will find that /search is disallowed to crawl.

这篇关于Nutch Crawling 不适用于特定 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆