将输入文件传递给包含要抓取的域列表的scrapy [英] Pass input file to scrapy containing a list of domains to be scraped

查看:32
本文介绍了将输入文件传递给包含要抓取的域列表的scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到了这个链接 [一个链接] (通过 .txt 文件向 Scrapy Spider 传递要抓取的 URL 列表)!这会更改起始 url 列表.我想为每个域(从文件)抓取网页并将结果放入一个单独的文件(以域命名).我已经抓取了一个网站的数据,但我已经在蜘蛛本身中指定了起始 url 和 allowed_domains.如何使用输入文件更改此设置.

I saw this link [a link] (Pass Scrapy Spider a list of URLs to crawl via .txt file)! This changes the list of start urls. I want to scrape webpages for each domain(from a file) and put results into a separate file(named after the domain). I have scraped data for a website but I have specified the start url and allowed_domains in the spider itself. How to change this using input file.

更新 1:

这是我试过的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class AppleItem(Item):
    reference_link = Field()
    rss_link = Field()

class AppleSpider(CrawlSpider):

    name = 'apple'
    allowed_domains = []
    start_urls = []

    def __init__(self):
        for line in open('./domains.txt', 'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://%s' % line)

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self, response):
        sel = HtmlXPathSelector(response)
        rsslinks = sel.select('//a[contains(@href, "pdf")]/@href').extract()
        items = []
        for rss in rsslinks:
          item = AppleItem()
          item['reference_link'] = response.url
          item['rss_link'] = rsslinks
          items.append(item)
        filename = response.url.split("/")[-2]
        open(filename+'.csv', 'wb').write(items)

运行此命令时出现错误:AttributeError: 'AppleSpider' object has no attribute '_rules'

I get an error when I run this: AttributeError: 'AppleSpider' object has no attribute '_rules'

推荐答案

你可以使用spider类的__init__方法读取文件并写入start_urlsallowed_domains.

You can use __init__ method of spider class to read file and owerrite start_urls and allowed_domains.

假设我们有包含内容的文件 domains.txt:

Suppose we have file domains.txt with content:

example1.com
example2.com
...

示例:

class MySpider(BaseSpider):
    name = "myspider"
    allowed_domains = []
    start_urls = []

    def __init__(self):
        for line in open('./domains.txt', 'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://%s' % line)

    def parse(self, response):
        # here you will get data parsing page
        # than put your data into single file
        # from scrapy toturial http://doc.scrapy.org/en/latest/intro/tutorial.html
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(your_data)

这篇关于将输入文件传递给包含要抓取的域列表的scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆