如何通过多个URL循环从Scrapy中的CSV文件中抓取? [英] How to loop through multiple URLs to scrape from a CSV file in Scrapy?

查看：156 发布时间：2020/7/11 22:49:18 python csv web-scraping scrapy

本文介绍了如何通过多个URL循环从Scrapy中的CSV文件中抓取?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从阿里巴巴网站上抓取数据的代码:

My code for scrapping data from alibaba website:

import scrapy




class IndiamartSpider(scrapy.Spider):
    name = 'alibot'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box_4.html']


    def parse(self, response):
        Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
        Price = response.xpath('//div[@class="price"]/b/text()').extract()
        Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
        Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()

        for item in zip(Title,Price,Min_order,Response_rate):
            scraped_info = {
                'Title':item[0],
                'Price': item[1],
                'Min_order':item[2],
                'Response_rate':item[3]

            }
            yield scraped_info

请注意起始网址，它只会通过给定的网址进行抓取，但是我希望这段代码可以删除csv文件中存在的所有网址.我的csv文件包含大量URL. data.csv文件的示例::

Notice the start url, it only scraps through the given URL, but i want this code to scrap all the urls present in my csv file. My csv file contains large amount of URLs. Sample of the data.csv file::

'https://www.alibaba.com/showroom/shock-absorber.html',
'https://www.alibaba.com/showroom/shock-wheel.html',
'https://www.alibaba.com/showroom/shoes-fastener.html',
'https://www.alibaba.com/showroom/shoes-women.html',
'https://www.alibaba.com/showroom/shoes.html',
'https://www.alibaba.com/showroom/shoulder-long-strip-bag.html',
'https://www.alibaba.com/showroom/shower-hair-band.html',
...........

我如何一次在代码中导入csv文件的所有链接?

How do i import all the links of csv file in the code at once?

推荐答案

要正确循环遍历文件而不将其全部加载到内存中，您应该使用生成器，因为python/scrapy中的文件对象和start_requests方法都是生成器:

To correctly loop through a file without loading all of it into memory you should use generators, as both file objects and start_requests method in python/scrapy are generators:

class MySpider(Spider):
    name = 'csv'

    def start_requests(self):
        with open('file.csv') as f:
            for line in f:
                if not line.strip():
                    continue
                yield Request(line)

进一步解释: Scrapy引擎使用start_requests生成请求.它将继续生成请求，直到并发请求限制已满(如CONCURRENT_REQUESTS的设置).
同样值得一提的是，默认情况下，scrapy会优先抓取深度-较新的请求优先，因此start_requests循环将最后完成.

To explain futher: Scrapy engine uses start_requests to generate requests as it goes. It will keep generating requests untill concurrent request limit is full (settings like CONCURRENT_REQUESTS).
Also worth noting that by default scrapy crawls depth first - newer requests take priority, so start_requests loop will be last to finish.

这篇关于如何通过多个URL循环从Scrapy中的CSV文件中抓取?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过多个URL循环从Scrapy中的CSV文件中抓取? [英] How to loop through multiple URLs to scrape from a CSV file in Scrapy?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过多个URL循环从Scrapy中的CSV文件中抓取? [英] How to loop through multiple URLs to scrape from a CSV file in Scrapy?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭