Scrapy 从文件中读取 URL 列表以进行抓取? [英] Scrapy read list of URLs from file to scrape?
本文介绍了Scrapy 从文件中读取 URL 列表以进行抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我刚刚安装了 scrapy 并遵循了他们简单的 dmoz tutorial .我只是查找了 python 的基本文件处理,并试图让爬虫从文件中读取 URL 列表,但出现了一些错误.这可能是错误的,但我试了一下.有人可以给我看一个将 URL 列表读入scrapy的例子吗?提前致谢.
I've just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list of URL's from a file but got some errors. This is probably wrong but I gave it a shot. Would someone please show me an example of reading a list of URL's into scrapy? Thanks in advance.
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
f = open("urls.txt")
start_urls = f
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
推荐答案
你们已经很接近了.
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
...最好还是使用上下文管理器来确保文件按预期关闭:
...better still would be to use the context manager to ensure the file's closed as expected:
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
这篇关于Scrapy 从文件中读取 URL 列表以进行抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文