Nutch不会使用查询字符串参数对URL进行爬网 [英] Nutch does not crawl URLs with query string parameters

查看:155
本文介绍了Nutch不会使用查询字符串参数对URL进行爬网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Nutch1.9并尝试使用单个命令进行爬网。当进入第二级生成器返回0记录时可以看出输出。
任何人都遇到过这个问题?我过去2天被困在这里。搜索了所有可能的选项。任何线索/帮助将非常感激。

I am using Nutch1.9 and trying to crawl using individual commands. as can be seen in the output when going in to the 2nd level generater returned with 0 records. any one has faced this issue ? i am stuck in here from past 2 days. have searched all possible options. any leads/helps would be much appreciated.

<br>#######  INJECT   ######<br>
Injector: starting at 2015-04-08 17:36:20 <br>
Injector: crawlDb: crawl/crawldb<br>
Injector: urlDir: urls<br>
Injector: Converting injected urls to crawl db entries.<br>
Injector: overwrite: false<br>
Injector: update: false<br>
Injector: Total number of urls rejected by filters: 0<br>
Injector: Total number of urls after normalization: 1<br>
Injector: Total new urls injected: 1<br>
Injector: finished at 2015-04-08 17:36:21, elapsed: 00:00:01<br>
####  GENERATE  ###<br>
Generator: starting at 2015-04-08 17:36:22<br>
Generator: Selecting best-scoring urls due for fetch.<br>
Generator: filtering: true<br>
Generator: normalizing: true<br>
Generator: topN: 100000<br>
Generator: jobtracker is 'local', generating exactly one partition.<br>
Generator: Partitioning selected urls for politeness.<br>
Generator: segment: crawl/segments/20150408173625<br>
Generator: finished at 2015-04-08 17:36:26, elapsed: 00:00:03<br>
crawl/segments/20150408173625<br>
#### FETCH  ####<br>
Fetcher: starting at 2015-04-08 17:36:26<br>
Fetcher: segment: crawl/segments/20150408173625<br>
Using queue mode : byHost<br>
Fetcher: threads: 10<br>
Fetcher: time-out divisor: 2<br>
QueueFeeder finished: total 1 records + hit by time limit :0<br>
Using queue mode : byHost<br>
fetching https://ifttt.com/recipes/search?q=SmartThings (queue crawl delay=5000ms)<br>
Using queue mode : byHost<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Fetcher: throughput threshold: -1<br>
Thread FetcherThread has no more work available<br>
Fetcher: throughput threshold retries: 5<br>
-finishing thread FetcherThread, activeThreads=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=0<br>
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0<br>
-activeThreads=0<br>
Fetcher: finished at 2015-04-08 17:36:33, elapsed: 00:00:06<br>
#### PARSE ####<br>
ParseSegment: starting at 2015-04-08 17:36:33<br>
ParseSegment: segment: crawl/segments/20150408173625<br>
ParseSegment: finished at 2015-04-08 17:36:35, elapsed: 00:00:01<br>
########   UPDATEDB   ##########<br>
CrawlDb update: starting at 2015-04-08 17:36:36<br>
CrawlDb update: db: crawl/crawldb<br>
CrawlDb update: segments: [crawl/segments/20150408173625]<br>
CrawlDb update: additions allowed: true<br>
CrawlDb update: URL normalizing: false<br>
CrawlDb update: URL filtering: false<br>
CrawlDb update: 404 purging: false<br>
CrawlDb update: Merging segment data into db.<br>
CrawlDb update: finished at 2015-04-08 17:36:37, elapsed: 00:00:01<br>
#####  GENERATE  ######<br>
Generator: starting at 2015-04-08 17:36:38<br>
Generator: Selecting best-scoring urls due for fetch.<br>
Generator: filtering: true<br>
Generator: normalizing: true<br>
Generator: topN: 100000<br>
Generator: jobtracker is 'local', generating exactly one partition.<br>
Generator: 0 records selected for fetching, exiting ...<br>
#######   EXTRACT  #########<br>
crawl/segments/20150408173625<br>
#### Segments #####<br>
20150408173625<br>

编辑:
所以我检查了另一个带有查询参数的URL( http://queue.acm.org/detail.cfm?id=988409 )它抓了它很好......

EDIT : So i checked with another URL with query params ( http://queue.acm.org/detail.cfm?id=988409 ) and it crawled it fine...

所以这意味着它特意没有抓取我原来的网址: https://ifttt.com/recipes/search?q=SmartThings&ac=true

so this means that it is specifically not crawling my original url : https://ifttt.com/recipes/search?q=SmartThings&ac=true

我已尝试为此ifttt域抓取没有查询字符串的网址,并且nutch成功抓取它...

i have tried crawling urls without querystring for this ifttt domain and nutch crawls it successfully...

我认为问题在于使用查询字符串抓取https网站。
关于此问题的任何帮助?

i think the issue is with crawling https website with query strings. any help regarding this issue ?

推荐答案

默认情况下,忽略或过滤掉带有查询参数的链接。要启用带参数的抓取网址,请转到 conf / regex-urlfilter.txt 并通过在该行的开头添加#来注释以下行。

By default, links with query parameters are ignored or filtered out. To enable crawling urls with parameters go to conf/regex-urlfilter.txt and comment the following line by adding # to the beginning of the line.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

这篇关于Nutch不会使用查询字符串参数对URL进行爬网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆