为什么我的scrapy 没有使用start_urls 列表中的所有url? [英] Why my scrapy does not used all urls in start_urls list?

查看:99
本文介绍了为什么我的scrapy 没有使用start_urls 列表中的所有url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 start_urls 列表中有将近 300 个 url,但是scrapy 只草绘了大约 200 个 url.但并非所有这些列出的网址.我不知道为什么?我该如何处理.我必须从网站上潦草地写下更多项目.

I have almost 300 urls in my start_urls list, but the scrapy only scrawl about 200 urls. But not all of these listed urls. I do not know why? How I can deal with that. I have to scrawl more items from the website.

另一个我不明白的问题是:scrapy 完成后如何查看日志错误?从终端或我必须编写代码才能查看日志错误.我认为默认情况下启用日志.

Another question I do not understand is: how I can see the log error when the scrapy finishes? From the terminal or I have to write code to see the log error. I think the log is enabled by default.

感谢您的回答.

更新:

输出如下.我不知道为什么只有 2829 项被刮掉了.实际上,我的 start_url 中有 600 个 url.

The output is in the following. I do not know why there are only 2829 items scraped. There are 600 urls in my start_urls actually.

但是当我在 start_urls 中只给出 400 个 url 时,它可以抓取 6000 个项目.我希望刮掉 www.yhd.com 的几乎整个网站.谁能再给点建议?

But when I only give 400 urls in start_urls, it can scrape 6000 items. I expect to scrape almost the whole website of www.yhd.com. Could anyone give any more suggestions?

2014-12-08 12:11:03-0600 [yhd2] INFO: Closing spider (finished)
2014-12-08 12:11:03-0600 [yhd2] INFO: Stored csv feed (2829 items) in myinfoDec.csv        
2014-12-08 12:11:03-0600 [yhd2] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 142586,
'downloader/request_count': 476,
'downloader/request_method_count/GET': 476,
'downloader/response_bytes': 2043856,
'downloader/response_count': 475,
'downloader/response_status_count/200': 474,
'downloader/response_status_count/504': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 8, 18, 11, 3, 607101),
'item_scraped_count': 2829,
'log_count/DEBUG': 3371,
'log_count/ERROR': 1,
'log_count/INFO': 14,
'response_received_count': 474,
'scheduler/dequeued': 476,
'scheduler/dequeued/memory': 476,
'scheduler/enqueued': 476,
'scheduler/enqueued/memory': 476,
'start_time': datetime.datetime(2014, 12, 8, 18, 4, 19, 698727)}
2014-12-08 12:11:03-0600 [yhd2] INFO: Spider closed (finished)

推荐答案

我终于解决了这个问题....

Finally I solved the problem....

首先,它没有抓取 start_urls 中列出的所有 url,因为我在 start_urls 中的 url 中有一个拼写错误.其中一个http://..."被错误地写成ttp://...",缺少第一个h".然后蜘蛛似乎停下来查看它后面列出的其余网址.吓坏了.

First,it does not crawl all url listed in start_urls is because I have a typo in url in start_urls. One of the "http://..." is mistakenly written as "ttp://...", the first 'h' is missing. Then it seems the spider stopped to looked at the rest urls listed after it. Horrifed.

其次,我通过点击Pycharm的配置面板解决了日志文件问题,该面板提供了显示日志文件面板.顺便说一下,我的scrapy框架被放到了Pycharm IDE中.这对我很有效.不是广告.

Second, I solved the log file problem by click the configuiration panel of Pycharm, which provides showing log file panel. By the way, my scrapy framework is put into the Pycharm IDE. It works great for me. Not advertisement.

感谢大家的意见和建议.

Thanks for all the comments and suggestions.

这篇关于为什么我的scrapy 没有使用start_urls 列表中的所有url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆