水平缩放Scrapyd [英] Horizontally scaling Scrapyd
问题描述
您将使用哪种工具或一组工具来水平扩展scrapyd,以动态方式将新计算机添加到scrapyd集群,并且如果需要,每台计算机具有N个实例.并非所有实例都共享一个共同的作业队列,但这确实很棒.
What tool or set of tools would you use for horizontally scaling scrapyd adding new machines to a scrapyd cluster dynamically and having N instances per machine if required. Is not neccesary for all the instances to share a common job queue, but that would be awesome.
Scrapy-cluster 看起来很有希望,但是我想要一个基于Scrapyd的解决方案,所以我听其他选择和建议.
Scrapy-cluster seems promising for the job but I want a Scrapyd based solution so I listen to other alternatives and suggestions.
推荐答案
I scripted my own load balancer for Scrapyd using its API and a wrapper.
from random import shuffle
from scrapyd_api.wrapper import ScrapydAPI
class JobLoadBalancer(object):
@classmethod
def get_less_occupied(
cls,
servers_urls=settings.SERVERS_URLS,
project=settings.DEFAULT_PROJECT,
acceptable=settings.ACCEPTABLE_PENDING):
free_runner = {'num_jobs': 9999, 'client': None}
# shuffle servers optimization
shuffle(servers_urls)
for url in servers_urls:
scrapyd = ScrapydAPI(target=url)
jobs = scrapyd.list_jobs(project)
num_jobs = len(jobs['pending'])
if free_runner['num_jobs'] > num_jobs:
free_runner['num_jobs'] = num_jobs
free_runner['client'] = scrapyd
# Optimization: if found acceptable pending operations in one server stop looking for another one
if free_runner['client'] and free_runner['num_jobs'] <= acceptable:
break
return free_runner['client']
单元测试:
def setUp(self):
super(TestFactory, self).setUp()
# Make sure this servers are running
settings.SERVERS_URLS = [
'http://localhost:6800',
'http://localhost:6900'
]
self.project = 'dummy'
self.spider = 'dummy_spider'
self.acceptable = 0
def test_get_less_occupied(self):
# add new dummy jobs to first server so that choose the second one
scrapyd = ScrapydAPI(target=settings.SERVERS_URLS[0])
scrapyd.schedule(project=self.project, spider=self.spider)
scrapyd.schedule(project=self.project, spider=self.spider)
second_server_url = settings.SERVERS_URLS[1]
scrapyd = JobLoadBalancer.get_less_occupied(
servers_urls=settings.SERVERS_URLS,
project=self.project,
acceptable=self.acceptable)
self.assertEqual(scrapyd.target, second_server_url)
此代码针对一年多以前编写的scrapyd的较旧版本.
This code targets an older version of scrapyd as it was written more than a year ago.
这篇关于水平缩放Scrapyd的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!