抓取一个完整的域并将所有 h1 加载到一个项目中 [英] Crawl a full domain and load all h1 into a item

查看:39
本文介绍了抓取一个完整的域并将所有 h1 加载到一个项目中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python和scrapy比较陌生.我想要实现的是抓取一些网站,主要是公司网站.抓取完整域并提取所有 h1 h2 h3.创建包含域名和字符串的记录,其中包含来自该域的所有 h1 h2 h3.基本上有一个域项目和一个包含所有标题的大字符串.

我希望输出是DOMAIN, STRING(h1,h2,h2) - 来自该域上的所有网址

我遇到的问题是每个 URL 都进入单独的项目.我知道我还没有走多远,但是非常感谢正确方向的提示.基本上,我如何创建一个外循环,以便让 yield 语句继续运行,直到下一个域启动.

from scrapy.spider import BaseSpider从 scrapy.contrib.spiders 导入 CrawlSpider,规则from scrapy.selector import HtmlXPathSelector从scrapy.http 导入FormRequest从scrapy.http导入请求从 Autotask_Prospecting.items 导入 AutotaskProspectingItem从 Autotask_Prospecting.items 导入 WebsiteItem从 scrapy.contrib.loader 导入 XPathItemLoader从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractorfrom scrapy.selector import Selector从 nltk 导入 clean_html类 MySpider(CrawlSpider):name = '网络爬虫'allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]规则 = (# 提取匹配category.php"的链接(但不匹配subsection.php")# 并关注他们的链接(因为没有回调意味着默认情况下 follow=True).# 提取匹配 'item.php' 的链接并使用蜘蛛的方法 parse_item 解析它们规则(SgmlLinkExtractor(), callback='parse_item'),)def parse_item(self, response):xpath = HtmlXPathSelector(响应)loader = XPathItemLoader(item=WebsiteItem(), response=response)loader.add_xpath('h1',("//h1/text()"))loader.add_xpath('h2',("//h2/text()"))loader.add_xpath('h3',("//h3/text()"))产量 loader.load_item()

解决方案

yield 语句一直持续到下一个域启动.

做不到,事情是并行的,没有办法让域爬取串行.

可以做的是写一个管道,它将累积它们并在 spider_close 上产生整个结构,例如:

# 假设您的项目如下所示类 MyItem():域 = 字段()hs = 字段()进口藏品类域管道(对象):累加器 = collections.defaultdict(set)def process_item(self, item, spider):accumulator[item['domain']].update(item['hs'])def close_spider(蜘蛛):对于域,accumulator.items() 中的 hs:产生我的项目(域=域,hs=hs)

用法:

<预><代码>>>>from scrapy.item import Item, Field>>>类我的项目(项目):...域=字段()... hs = Field()...>>>从集合导入 defaultdict>>>累加器 = defaultdict(set)>>>项目 = []>>>对于范围内的我(10):... items.append(MyItem(domain='google.com', hs=[str(i)]))...>>>项目[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain':'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com','hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'域':'google.com','hs':['9']}]>>>对于项目中的项目:... accumulator[item['domain']].update(item['hs'])...>>>累加器defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})>>>对于域,accumulator.items() 中的 hs:... 打印 MyItem(domain=domain, hs=hs)...{'域名':'google.com','hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}>>>

I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Create a record that contains the domain name and a string with all the h1 h2 h3 from that domain. Basically have a Domain item and a large string containing all the headers.

I would like the output to be DOMAIN, STRING(h1,h2,h2) - from all the urls on this domain

The problem I have is that each URL goes into separate Items. I know I haven't gotten very far but a hint in the right direction would be very much appreciated. Basically, how I create an outer loop so that the yield statement keeps going until the next domain is up.

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()

解决方案

yield statement keeps going until the next domain is up.

cannot be done, things are done in parallel and there is no way to make domain crawling serially.

what you can do is to write a pipeline that will accumulate them and yield the entire structure on spider_close, something like:

# this assume your item looks like the following
class MyItem():
    domain = Field()
    hs = Field()


import collections
class DomainPipeline(object):

    accumulator = collections.defaultdict(set)

    def process_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    def close_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)

usage:

>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
...     domain = Field()
...     hs = Field()
... 
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... 
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... 
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
...     print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>> 

这篇关于抓取一个完整的域并将所有 h1 加载到一个项目中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆