为什么我的scrapy 没有使用start_urls 列表中的所有url? [英] Why my scrapy does not used all urls in start_urls list?

查看：99 发布时间：2021/6/26 20:27:53 python-2.7 scrapy scrapy-spider scrapy-shell

本文介绍了为什么我的scrapy 没有使用start_urls 列表中的所有url?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 start_urls 列表中有将近 300 个 url，但是scrapy 只草绘了大约 200 个 url.但并非所有这些列出的网址.我不知道为什么?我该如何处理.我必须从网站上潦草地写下更多项目.

I have almost 300 urls in my start_urls list, but the scrapy only scrawl about 200 urls. But not all of these listed urls. I do not know why? How I can deal with that. I have to scrawl more items from the website.

另一个我不明白的问题是:scrapy 完成后如何查看日志错误?从终端或我必须编写代码才能查看日志错误.我认为默认情况下启用日志.

Another question I do not understand is: how I can see the log error when the scrapy finishes? From the terminal or I have to write code to see the log error. I think the log is enabled by default.

感谢您的回答.

更新:

输出如下.我不知道为什么只有 2829 项被刮掉了.实际上，我的 start_url 中有 600 个 url.

The output is in the following. I do not know why there are only 2829 items scraped. There are 600 urls in my start_urls actually.

但是当我在 start_urls 中只给出 400 个 url 时，它可以抓取 6000 个项目.我希望刮掉 www.yhd.com 的几乎整个网站.谁能再给点建议?

But when I only give 400 urls in start_urls, it can scrape 6000 items. I expect to scrape almost the whole website of www.yhd.com. Could anyone give any more suggestions?

2014-12-08 12:11:03-0600 [yhd2] INFO: Closing spider (finished)
2014-12-08 12:11:03-0600 [yhd2] INFO: Stored csv feed (2829 items) in myinfoDec.csv        
2014-12-08 12:11:03-0600 [yhd2] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 142586,
'downloader/request_count': 476,
'downloader/request_method_count/GET': 476,
'downloader/response_bytes': 2043856,
'downloader/response_count': 475,
'downloader/response_status_count/200': 474,
'downloader/response_status_count/504': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 8, 18, 11, 3, 607101),
'item_scraped_count': 2829,
'log_count/DEBUG': 3371,
'log_count/ERROR': 1,
'log_count/INFO': 14,
'response_received_count': 474,
'scheduler/dequeued': 476,
'scheduler/dequeued/memory': 476,
'scheduler/enqueued': 476,
'scheduler/enqueued/memory': 476,
'start_time': datetime.datetime(2014, 12, 8, 18, 4, 19, 698727)}
2014-12-08 12:11:03-0600 [yhd2] INFO: Spider closed (finished)

为什么我的scrapy 没有使用start_urls 列表中的所有url? [英] Why my scrapy does not used all urls in start_urls list?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么我的scrapy 没有使用start_urls 列表中的所有url? [英] Why my scrapy does not used all urls in start_urls list?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭