Scrapy 从文件中读取 URL 列表以进行抓取? [英] Scrapy read list of URLs from file to scrape?

查看:29
本文介绍了Scrapy 从文件中读取 URL 列表以进行抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚安装了 scrapy 并遵循了他们简单的 dmoz tutorial .我只是查找了 python 的基本文件处理,并试图让爬虫从文件中读取 URL 列表,但出现了一些错误.这可能是错误的,但我试了一下.有人可以给我看一个将 URL 列表读入scrapy的例子吗?提前致谢.

I've just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list of URL's from a file but got some errors. This is probably wrong but I gave it a shot. Would someone please show me an example of reading a list of URL's into scrapy? Thanks in advance.

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    f = open("urls.txt")
    start_urls = f

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

推荐答案

你们已经很接近了.

f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

...最好还是使用上下文管理器来确保文件按预期关闭:

...better still would be to use the context manager to ensure the file's closed as expected:

with open("urls.txt", "rt") as f:
    start_urls = [url.strip() for url in f.readlines()]

这篇关于Scrapy 从文件中读取 URL 列表以进行抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆