Scrapy 在抓取一长串 url 时卡住了 [英] Scrapy gets stuck crawling a long list of urls
问题描述
我正在抓取一个大的 url 列表(1000-ish),在设定的时间后,爬虫卡住了爬行 0 页/分钟.爬行时问题总是出现在同一个地方.url 列表是从 MySQL 数据库中检索的.我对python和scrapy相当陌生,所以我不知道从哪里开始调试,我担心由于我的经验不足,代码本身也有点混乱.感谢您指出问题所在.
I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated.
我曾经一次检索整个网址列表,并且爬虫工作正常.但是,我在将结果写回数据库时遇到了问题,并且我不想将整个大的 url 列表读入内存,所以我将其更改为一次一个 url 遍历数据库,问题发生在那里.我相当肯定 url 本身不是问题,因为当我尝试从有问题的 url 开始爬行时,它可以正常工作,在不同但一致的位置进一步卡住.
I used to retrieve the entire list of urls in one go, and the crawler worked fine. However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred. I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.
代码的相关部分如下.请注意,该脚本应该作为独立脚本运行,这就是我在蜘蛛本身中定义必要设置的原因.
The relevant parts of the code are as follow. Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.
class MySpider(CrawlSpider):
name = "mySpider"
item = []
#spider settings
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DEPTH_LIMIT': 1,
'DNS_TIMEOUT': 5,
'DOWNLOAD_TIMEOUT':5,
'RETRY_ENABLED': False,
'REDIRECT_MAX_TIMES': 1
}
def start_requests(self):
while i < n_urls:
urllist = "SELECT url FROM database WHERE id=" + i
cursor = db.cursor()
cursor.execute(urllist)
urls = cursor.fetchall()
urls = [i[0] for i in urls] #fetch url from inside list of tuples
urls = str(urls[0]) #transform url into string from list
yield Request(urls, callback=self.parse, errback=self.errback)
def errback(self, failure):
global i
sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
val = ('Error', str(j))
cursor.execute(sql, val)
db.commit()
i += 1
def parse(self, response):
global i
item = myItem()
item["result"] = response.xpath("//item to search")
if item["result"] is None or len(item["result"]) == 0:
sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
val = ('None', str(i))
cursor.execute(sql, val)
db.commit()
i += 1
else:
sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
val = ('Item', str(i))
cursor.execute(sql, val)
db.commit()
i += 1
刮板卡住并显示以下消息:
The scraper gets stuck showing the following message:
2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
到目前为止一切正常.感谢您能给我的任何帮助!
Everything works fine up until this point. Any help you could give me is appreciated!
推荐答案
scrapy syas 0 item 的原因是它会计算产生的数据,而您除了插入数据库之外没有产生任何东西.
The reason scrapy syas 0 item is that it counts the yielded data while you are not yielding anything but inserting in your database.
这篇关于Scrapy 在抓取一长串 url 时卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!