抓取多个 URL 的 Scrapy 方法 [英] Scrapy approach to scraping multiple URLs

查看:86
本文介绍了抓取多个 URL 的 Scrapy 方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目需要完成大量数据抓取.

I have a project which requires a great deal of data scraping to be done.

我一直在研究 Scrapy,到目前为止我对它印象非常深刻,但我正在寻找执行以下操作的最佳方法:

I've been looking at Scrapy which so far I am very impressed with but I am looking for the best approach to do the following:

1) 我想抓取多个 URL 并为每个要抓取的 URL 传入相同的变量,例如,假设我想返回来自 Bing、Google 和 Yahoo 的关键字python"的最高结果.

1) I want to scrape multiple URL's and pass in the same variable for each URL to be scraped, for example, lets assume I am wanting to return the top result for the keyword "python" from Bing, Google and Yahoo.

我想抓取 http://www.google.co.uk/q=pythonhttp://www.yahoo.com?q=pythonhttp://www.bing.com/?q=python(不是实际的 URL,但你懂的)

I would want to scrape http://www.google.co.uk/q=python, http://www.yahoo.com?q=python and http://www.bing.com/?q=python (not the actual URLs but you get the idea)

我找不到使用关键字指定动态 URL 的方法,我能想到的唯一选择是用 PHP 或其他方式生成一个文件来构建 URL 并指定scrapy 来抓取 URL 中的链接.

I can't find a way to specify dynamic URLs using the keyword, the only option I can think of is to generate a file in PHP or other which builds the URL and specify scrapy to crawl the links in the URL.

2) 显然每个搜索引擎都有自己的标记,所以我需要区分每个结果以找到相应的 XPath 以从中提取相关数据

2) Obviously each search engine would have its own mark-up so I would need to differentiate between each result to find the corresponding XPath to extract the relevant data from

3) 最后,我想将抓取的 Item 的结果写入数据库(可能是 redis),但只有当所有 3 个 URL 都完成抓取时,基本上我想从 3搜索引擎并将输出结果保存在一笔交易中.

3) Lastly, I would like to write the results of the scraped Item to a database (probably redis), but only when all 3 URLs have finished scraping, essentially I am wanting to build up a "profile" from the 3 search engines and save the outputted result in one transaction.

如果有人对以上任何一点有任何想法,我将不胜感激.

If anyone has any thoughts on any of these points I would be very grateful.

谢谢

推荐答案

1) 在 BaseSpider 中,有一个 __init__ 方法可以在子类中被覆盖.这是设置 start_urls 和 allowed_domains 变量声明的地方.如果您有一个网址列表,在运行蜘蛛之前,您可以在此处动态插入它们.

1) In the BaseSpider, there is an __init__ method that can be overridden in subclasses. This is where the declaration of the start_urls and allowed_domains variables are set. If you have a list of urls in mind, prior to running the spider, than you can insert them dynamically here.

例如,在我构建的一些蜘蛛程序中,我从 MongoDB 中提取预先格式化的 URL 组,并一次性将它们插入到 start_urls 列表中.

For example, in a few of the spiders I have built, I pull in preformatted groups of URL's from MongoDB, and insert them into the start_urls list in once bulk insert.

2) 这可能有点棘手,但您可以通过查看响应对象 (response.url) 轻松看到抓取的 URL.您应该能够检查该网址是否包含google"、bing"或yahoo",然后针对该类型的网址使用预先指定的选择器.

2)This might be a little bit more tricky, but you could easily see the crawled URL by looking in the response object (response.url). You should be able to check to see if the url contains 'google', 'bing', or 'yahoo', and then use the prespecified selectors for a url of that type.

3) 我不太确定#3 是可能的,或者至少不是没有一些困难.据我所知,start_urls 列表中的 url 不是按顺序抓取的,它们每个都独立地到达管道.我不确定如果没有一些严重的核心黑客攻击,您是否能够收集一组响应对象并将它们一起传递到管道中.

3) I am not so sure that #3 is possible, or at least not without some difficulty. As far as I know, the url's in the start_urls list are not crawled orderly, and they each arrive in the pipeline independently. I am not sure that without some serious core hacking, you will be able to collect a group of response objects and pass them into a pipeline together.

但是,您可以考虑暂时将数据序列化到磁盘,然后稍后将数据批量保存到数据库中.我构建的其中一个爬虫接收了数量约为 10000 个的 URL 组.我没有进行 10000 次单项数据库插入,而是将 url(和收集的数据)存储在 BSON 中,稍后将其插入 MongoDB.

However, you might consider serializing the data to disk temporarily, and then bulk-saving the data later on to your database. One of the crawlers I built receives groups of URLs that are around 10000 in number. Rather than making 10000 single item database insertions, I store the urls (and collected data) in BSON, and than insert it into MongoDB later.

这篇关于抓取多个 URL 的 Scrapy 方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆