从scrapy中的多个文件中抓取多个地址 [英] scrape multiple addresses from multiple files in scrapy
问题描述
我在一个目录中有一些 JSON 文件.在任何这些文件中,都有一些我需要的信息.我需要的第一个属性是scrapy中start_urls"的链接列表.
每个文件都用于不同的进程,所以它的输出必须是分开的.所以我无法将所有json文件中的所有链接都放入start_urls中并一起运行.我必须为每个文件运行蜘蛛.
我该怎么做?到目前为止,这是我的代码:
导入scrapy从操作系统导入列表目录从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']# 从每个文件的链接列表中获取起始 urlfiles = [f for f in listdir('lastsecond/hotels/') ifisfile(join('lastsecond/hotels/', f))]使用 open('lastsecond/hotels/' + files[0], 'r') 作为酒店信息:酒店 = json.load(hotel_info)start_urls = 酒店[链接"]定义解析(自我,响应):打印(一切都好")
我看到了两种方法
<小时>首先:
使用不同的参数多次运行蜘蛛.它将需要更少的代码.
您可以创建多行并手动添加不同参数的批处理.
第一个参数是scrapy将自动创建的输出文件名-o result1.csv
.
第二个参数是带有链接的输入文件名 -a filename=process1.csv
.
scrapy crawl hotel_info -o result1.csv -a filename=process1.csv抓取酒店信息 -o result2.csv -a 文件名=process2.csv抓取酒店信息 -o result3.csv -a 文件名=process3.csv...
a 它只需要在 __init__
filename
导入scrapy从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']def __init__(self, filename, *args, **kwargs): # <-- 文件名super().__init__(*args, **kwargs)filename = join('lastsecond/hotels/', 文件名)如果是文件(文件名):使用 open(filename) 作为 f:数据 = json.load(f)self.start_urls = 数据['链接']定义解析(自我,响应):打印('网址:',响应.网址)产量 {'url':, response.url, 'other': ...}
您也可以使用带有 CrawlerProcess
的 Python 脚本多次运行蜘蛛.
from scrapy.crawler import CrawlerProcess导入 HotelInfoSpider从 os.path 导入 isfile,加入导入jsonfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于 i,枚举(文件)中的 input_file:output_file = 'result{}.csv'.format(i)c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})c.crawl(HotelInfoSpider, 文件名=input_file) #input_file='process1.csv')c.开始()
或者使用 scrapy.cmdline.execute()
import scrapy.cmdline从 os.path 导入 isfile,加入导入jsonfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于 i,枚举(文件)中的 input_file:output_file = 'result{}.csv'.format(i)scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])
<小时>
第二:
它需要更多代码,因为您必须创建管道导出器,它将使用不同的文件来保存结果.
您必须使用 start_requests()
和 Request(..., meta=...)
来创建带有请求的 start_urls
将在 meta
中有 extra
数据,您可以稍后使用这些数据保存在不同的文件中.
在 parse()
中,您必须从 meta
中获取此 extra
并添加到 item
.>
在管道导出器中,您必须从 item
获取 extra
并打开不同的文件.
导入scrapy从操作系统导入列表目录从 os.path 导入 isfile,加入导入json类 HotelInfoSpider(scrapy.Spider):名称 = '酒店信息'allowed_domains = ['lastsecond.ir']def start_requests(self):# 从每个文件的链接列表中获取起始 urlfiles = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]对于我,枚举(文件)中的文件名:with open('lastsecond/hotels/' + filename) as f:数据 = json.load(f)链接 = 数据[链接"]对于链接中的网址:yield scrapy.Request(url, meta={'extra': i})定义解析(自我,响应):打印('网址:',响应.网址)额外 = response.meta['额外']打印('额外:',额外的)产量 {'url': response.url, 'extra': extra, 'other': ...}
管道.py
class MyExportPipeline(object):def process_item(self, item, spider):# 获取额外并在文件名中使用它文件名 = 'result{}.csv'.format(item['extra'])# 打开文件进行追加with open(filename, 'a') as f:作家 = csv.writer(f)# 只写入选定的元素 - 跳过 `extra`row = [item['url'], item['other']writer.writerow(行)归还物品
settings.py
ITEM_PIPELINES = {'your_porject_name.pipelines.MyExportPipeline':300,}
I have some JSON file in a directory. In any of this files, there are some information I need. the first property i need is the links list for "start_urls" in scrapy.
every file is for a different process, so its output must be separate. So I can't put all the links in all the json files into start_urls and run them together. i have to run the spider for everyfile.
how can i do this? here is my code so far:
import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
# get start urls from links list of every file
files = [f for f in listdir('lastsecond/hotels/') if
isfile(join('lastsecond/hotels/', f))]
with open('lastsecond/hotels/' + files[0], 'r') as hotel_info:
hotel = json.load(hotel_info)
start_urls = hotel["links"]
def parse(self, response):
print("all good")
I see two methods
First:
Run spider many times with different parameters. It will need less code.
You can create batch with many lines with different arguments added manually.
First argument is output filename -o result1.csv
which scrapy will create automatically.
Second argument is input filename -a filename=process1.csv
with links.
scrapy crawl hotel_info -o result1.csv -a filename=process1.csv
scrapy crawl hotel_info -o result2.csv -a filename=process2.csv
scrapy crawl hotel_info -o result3.csv -a filename=process3.csv
...
a it needs only to get filename
in __init__
import scrapy
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
def __init__(self, filename, *args, **kwargs): # <-- filename
super().__init__(*args, **kwargs)
filename = join('lastsecond/hotels/', filename)
if isfile(filename):
with open(filename) as f:
data = json.load(f)
self.start_urls = data['links']
def parse(self, response):
print('url:', response.url)
yield {'url':, response.url, 'other': ...}
You can also use Python script with CrawlerProcess
to run spider many times.
from scrapy.crawler import CrawlerProcess
import HotelInfoSpider
from os.path import isfile, join
import json
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, input_file in enumerate(files):
output_file = 'result{}.csv'.format(i)
c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})
c.crawl(HotelInfoSpider, filename=input_file) #input_file='process1.csv')
c.start()
Or using scrapy.cmdline.execute()
import scrapy.cmdline
from os.path import isfile, join
import json
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, input_file in enumerate(files):
output_file = 'result{}.csv'.format(i)
scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])
Second:
It needs more code because you have to create Pipeline Exporter which will use different files to save results.
You have to use start_requests()
and Request(..., meta=...)
to create start_urls
with requests which will have extra
data in meta
which you can use later to save in different files.
In parse()
you have to get this extra
from meta
and add to item
.
In pipeline exporter you have to get extra
from item
and open different file.
import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
def start_requests(self):
# get start urls from links list of every file
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, filename in enumerate(files):
with open('lastsecond/hotels/' + filename) as f:
data = json.load(f)
links = data["links"]
for url in links:
yield scrapy.Request(url, meta={'extra': i})
def parse(self, response):
print('url:', response.url)
extra = response.meta['extra']
print('extra:', extra)
yield {'url': response.url, 'extra': extra, 'other': ...}
pipelines.py
class MyExportPipeline(object):
def process_item(self, item, spider):
# get extra and use it in filename
filename = 'result{}.csv'.format(item['extra'])
# open file for appending
with open(filename, 'a') as f:
writer = csv.writer(f)
# write only selected elements - skip `extra`
row = [item['url'], item['other']
writer.writerow(row)
return item
settings.py
ITEM_PIPELINES = {
'your_porject_name.pipelines.MyExportPipeline': 300,
}
这篇关于从scrapy中的多个文件中抓取多个地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!