向Scrapy Spider传递URL列表以通过.txt文件进行爬网 [英] Pass Scrapy Spider a list of URLs to crawl via .txt file

查看:101
本文介绍了向Scrapy Spider传递URL列表以通过.txt文件进行爬网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,也是Scrapy的新手。

I'm a little new to Python and very new to Scrapy.

我已经设置了一个蜘蛛来抓取并提取所需的所有信息。但是,我需要将URL的.txt文件传递给start_urls变量。

I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable.

例如:

class LinkChecker(BaseSpider):
    name = 'linkchecker'
    start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.

我做了一些研究,并且总是空手而归。我已经看到了此类示例(如何通过用户在scrapy spider中定义的参数),但我认为这不适用于传递文本文件。

I've done a little bit of research and keep coming up empty handed. I've seen this type of example (How to pass a user defined argument in scrapy spider), but I don't think that will work for a passing a text file.

推荐答案

使用 -a 选项运行蜘蛛,例如:

Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

然后在 __ init __ 方法并定义 start_urls

class MySpider(BaseSpider):
    name = 'myspider'

    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls = f.readlines()

希望有帮助。

这篇关于向Scrapy Spider传递URL列表以通过.txt文件进行爬网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆