抓取域列表的登录页面 [英] Scraping landing pages of a list of domains

查看:43
本文介绍了抓取域列表的登录页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当长的网站列表,我想下载登陆(index.html 或等效的)页面.我目前正在使用 Scrapy(非常喜欢它背后的人——这是一个很棒的框架).Scrapy 在这个特定任务上比我想要的要慢,我想知道 wget 或其他替代方法是否会更快,因为任务是多么简单.有任何想法吗?

I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?

(这是我正在用 Scrapy 做的事情.我能做些什么来优化这个任务的 scrapy?)

(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )

所以,我有一个像

start_urls=[google.com雅虎网aol.com]

start_urls=[google.com yahoo.com aol.com]

我从每个响应中抓取文本并将其存储在 xml 中.我需要关闭场外中间件以允许多个域.

And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.

Scrapy 按预期工作,但似乎很慢(大约 1000 小时或 1每 4 秒).有没有办法通过增加运行单个时 CONCURRENT_REQUESTS_PER_SPIDER 的数量蜘蛛?还要别的吗?

Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?

推荐答案

如果你想用 python 同时下载多个站点,你可以用这样的标准库来实现:

If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

您也可以查看 httplib2PycURL 来代替 urllib 为您下载.

You could also check out httplib2 or PycURL to do the downloading for you instead of urllib.

我不太清楚您希望将抓取的文本作为 xml 的外观,但是您可以使用标准库中的 xml.etree.ElementTree 或者您可以安装 BeautifulSoup(这会更好,因为它可以处理格式错误的标记).

I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree from the standard library or you could install BeautifulSoup (which would be better as it handles malformed markup).

这篇关于抓取域列表的登录页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆