抓取域列表的登录页面 [英] Scraping landing pages of a list of domains
问题描述
我有一个相当长的网站列表,我想下载登陆(index.html 或等效的)页面.我目前正在使用 Scrapy(非常喜欢它背后的人——这是一个很棒的框架).Scrapy 在这个特定任务上比我想要的要慢,我想知道 wget 或其他替代方法是否会更快,因为任务是多么简单.有任何想法吗?
I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?
(这是我正在用 Scrapy 做的事情.我能做些什么来优化这个任务的 scrapy?)
(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )
所以,我有一个像
start_urls=[google.com雅虎网aol.com]
start_urls=[google.com yahoo.com aol.com]
我从每个响应中抓取文本并将其存储在 xml 中.我需要关闭场外中间件以允许多个域.
And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.
Scrapy 按预期工作,但似乎很慢(大约 1000 小时或 1每 4 秒).有没有办法通过增加运行单个时 CONCURRENT_REQUESTS_PER_SPIDER 的数量蜘蛛?还要别的吗?
Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?
推荐答案
如果你想用 python 同时下载多个站点,你可以用这样的标准库来实现:
If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:
import threading
import urllib
maxthreads = 4
sites = ['google.com', 'yahoo.com', ] # etc.
class Download(threading.Thread):
def run (self):
global sites
while sites:
site = sites.pop()
print "start", site
urllib.urlretrieve('http://' + site, site)
print "end ", site
for x in xrange(min(maxthreads, len(sites))):
Download().start()
您也可以查看 httplib2
或 PycURL
来代替 urllib
为您下载.
You could also check out httplib2
or PycURL
to do the downloading for you instead of urllib
.
我不太清楚您希望将抓取的文本作为 xml 的外观,但是您可以使用标准库中的 xml.etree.ElementTree
或者您可以安装 BeautifulSoup
(这会更好,因为它可以处理格式错误的标记).
I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree
from the standard library or you could install BeautifulSoup
(which would be better as it handles malformed markup).
这篇关于抓取域列表的登录页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!