从scrapy中的多个文件中抓取多个地址 [英] scrape multiple addresses from multiple files in scrapy

查看:108
本文介绍了从scrapy中的多个文件中抓取多个地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个目录中有一些 JSON 文件.在任何这些文件中,都有一些我需要的信息.我需要的第一个属性是scrapy中start_urls"的链接列表.

每个文件都用于不同的进程,所以它的输出必须是分开的.所以我无法将所有json文件中的所有链接都放入start_urls中并一起运行.我必须为每个文件运行蜘蛛.

我该怎么做?到目前为止,这是我的代码:

导入scrapy从操作系统导入列表目录从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']# 从每个文件的链接列表中获取起始 urlfiles = [f for f in listdir('lastsecond/hotels/') ifisfile(join('lastsecond/hotels/', f))]使用 open('lastsecond/hotels/' + files[0], 'r') 作为酒店信息:酒店 = json.load(hotel_info)start_urls = 酒店[链接"]定义解析(自我,响应):打印(一切都好")

解决方案

我看到了两种方法

<小时>

首先:

使用不同的参数多次运行蜘蛛.它将需要更少的代码.

您可以创建多行并手动添加不同参数的批处理.

第一个参数是scrapy将自动创建的输出文件名-o result1.csv.
第二个参数是带有链接的输入文件名 -a filename=process1.csv.

scrapy crawl hotel_info -o result1.csv -a filename=process1.csv抓取酒店信息 -o result2.csv -a 文件名=process2.csv抓取酒店信息 -o result3.csv -a 文件名=process3.csv...

a 它只需要在 __init__

中获取 filename

导入scrapy从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']def __init__(self, filename, *args, **kwargs): # <-- 文件名super().__init__(*args, **kwargs)filename = join('lastsecond/hotels/', 文件名)如果是文件(文件名):使用 open(filename) 作为 f:数据 = json.load(f)self.start_urls = 数据['链接']定义解析(自我,响应):打印('网址:',响应.网址)产量 {'url':, response.url, 'other': ...}

您也可以使用带有 CrawlerProcess 的 Python 脚本多次运行蜘蛛.

from scrapy.crawler import CrawlerProcess导入 HotelInfoSpider从 os.path 导入 isfile,加入导入jsonfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于 i,枚举(文件)中的 input_file:output_file = 'result{}.csv'.format(i)c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})c.crawl(HotelInfoSpider, 文件名=input_file) #input_file='process1.csv')c.开始()

或者使用 scrapy.cmdline.execute()

import scrapy.cmdline从 os.path 导入 isfile,加入导入jsonfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于 i,枚举(文件)中的 input_file:output_file = 'result{}.csv'.format(i)scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])

<小时>

第二:

它需要更多代码,因为您必须创建管道导出器,它将使用不同的文件来保存结果.

您必须使用 start_requests()Request(..., meta=...) 来创建带有请求的 start_urls将在 meta 中有 extra 数据,您可以稍后使用这些数据保存在不同的文件中.

parse() 中,您必须从 meta 中获取此 extra 并添加到 item.

在管道导出器中,您必须从 item 获取 extra 并打开不同的文件.

导入scrapy从操作系统导入列表目录从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']def start_requests(self):# 从每个文件的链接列表中获取起始 urlfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于我,枚举(文件)中的文件名:with open('lastsecond/hotels/' + filename) as f:数据 = json.load(f)链接 = 数据[链接"]对于链接中的网址:yield scrapy.Request(url, meta={'extra': i})定义解析(自我,响应):打印('网址:',响应.网址)额外 = response.meta['额外']打印('额外:',额外的)产量 {'url': response.url, 'extra': extra, 'other': ...}

管道.py

class MyExportPipeline(object):def process_item(self, item, spider):# 获取额外并在文件名中使用它文件名 = 'result{}.csv'.format(item['extra'])# 打开文件进行追加with open(filename, 'a') as f:作家 = csv.writer(f)# 只写入选定的元素 - 跳过 `extra`row = [item['url'], item['other']writer.writerow(行)归还物品

settings.py

ITEM_PIPELINES = {'your_porject_name.pipelines.MyExportPipeline':300,}

I have some JSON file in a directory. In any of this files, there are some information I need. the first property i need is the links list for "start_urls" in scrapy.

every file is for a different process, so its output must be separate. So I can't put all the links in all the json files into start_urls and run them together. i have to run the spider for everyfile.

how can i do this? here is my code so far:

import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
    name = 'hotel_info'
    allowed_domains = ['lastsecond.ir']
    # get start urls from links list of every file
    files = [f for f in listdir('lastsecond/hotels/') if 
    isfile(join('lastsecond/hotels/', f))]
    with open('lastsecond/hotels/' + files[0], 'r') as hotel_info:
        hotel = json.load(hotel_info)
    start_urls = hotel["links"]

    def parse(self, response):
        print("all good")

解决方案

I see two methods


First:

Run spider many times with different parameters. It will need less code.

You can create batch with many lines with different arguments added manually.

First argument is output filename -o result1.csv which scrapy will create automatically.
Second argument is input filename -a filename=process1.csv with links.

scrapy crawl hotel_info -o result1.csv -a filename=process1.csv
scrapy crawl hotel_info -o result2.csv -a filename=process2.csv
scrapy crawl hotel_info -o result3.csv -a filename=process3.csv
...

a it needs only to get filename in __init__

import scrapy
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def __init__(self, filename, *args, **kwargs): # <-- filename
        super().__init__(*args, **kwargs)

        filename = join('lastsecond/hotels/', filename) 

        if isfile(filename):
            with open(filename) as f:
                data = json.load(f)
                self.start_urls = data['links']

    def parse(self, response):
        print('url:', response.url)

        yield {'url':, response.url, 'other': ...}

You can also use Python script with CrawlerProcess to run spider many times.

from scrapy.crawler import CrawlerProcess
import HotelInfoSpider
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})
    c.crawl(HotelInfoSpider, filename=input_file) #input_file='process1.csv')
    c.start()

Or using scrapy.cmdline.execute()

import scrapy.cmdline
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])


Second:

It needs more code because you have to create Pipeline Exporter which will use different files to save results.

You have to use start_requests() and Request(..., meta=...) to create start_urls with requests which will have extra data in meta which you can use later to save in different files.

In parse() you have to get this extra from meta and add to item.

In pipeline exporter you have to get extra from item and open different file.

import scrapy
from os import listdir
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def start_requests(self):

        # get start urls from links list of every file
        files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

        for i, filename in enumerate(files):
            with open('lastsecond/hotels/' + filename) as f:
                data = json.load(f)
                links = data["links"]
                for url in links:
                    yield scrapy.Request(url, meta={'extra': i})

    def parse(self, response):
        print('url:', response.url)
        extra = response.meta['extra']
        print('extra:', extra)

        yield {'url': response.url, 'extra': extra, 'other': ...}

pipelines.py

class MyExportPipeline(object):

    def process_item(self, item, spider):

        # get extra and use it in filename
        filename = 'result{}.csv'.format(item['extra'])

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements - skip `extra`
            row = [item['url'], item['other']
            writer.writerow(row)

        return item

settings.py

ITEM_PIPELINES = {
   'your_porject_name.pipelines.MyExportPipeline': 300,
}

这篇关于从scrapy中的多个文件中抓取多个地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆