部署到 Scraping Hub 和蜘蛛运行时找不到 URL 文本文件 [英] URL text file not found when deployed to Scraping Hub and spider run
问题描述
我的蜘蛛依赖于一个 .txt
文件,该文件包含蜘蛛访问的 URL.我将该文件放在蜘蛛代码所在的同一目录中,以及它之前的每个目录中(Hail Marry 方法);最终结果是这样的:
My spider relies on a .txt
file that contains the URLs the spider goes to. I have placed that file in the same directory the spider code is located, and in every directory before it (Hail Marry approach); the end result is this:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/app/__main__.egg/CCSpider1/spiders/cc_1_spider.py", line 41, in start_requests
for line in fileinput.input({url_file}):
File "/usr/local/lib/python2.7/fileinput.py", line 237, in next
line = self._readline()
File "/usr/local/lib/python2.7/fileinput.py", line 339, in _readline
self._file = open(self._filename, self._mode)
IOError: [Errno 2] No such file or directory: 'url_list_20171028Z.txt'
问题
如何确保在运行爬虫时始终可以找到 url_list_20171028Z.txt
?此 URL 文本文件每天更新(新的文件会在第二天加盖戳记——例如:url_list_20171029Z.txt
等).
Question
How do I ensure that url_list_20171028Z.txt
is always found when I run my spider? This URL text file updates every day (a new one is stamped with the next day -- e.x: url_list_20171029Z.txt
, etc.).
感谢您解决我的问题.我是 Python 的新手(从 2017 年 6 月开始学习),我将这个抓取项目作为乐趣和学习经验.我最近才开始使用scrapy(2017 年10 月),因此对于任何明显的简单性从我脑海中掠过,我深表歉意.
Thank you for taking a crack at my issue. I am a new to Python (started learning in June, 2017) and I am taking this scraping project for fun and as a learning expirience. I only started using scrapy recently (October 2017), so apologies for any blatant simplicity passing over my head.
此项目已上传到Scraping Hub 网站.当我尝试从 Scraping Hub 仪表板运行我的蜘蛛时,会出现这个问题.蜘蛛部署成功,我制作了一个requirements.txt
文件下载我的蜘蛛使用的Pandas
包.
This project has been uploaded to Scraping Hub website. This issues is popping up when I try to run my spider from the Scraping Hub dashboard. The deployment of the spider was successful, and I made a requirements.txt
file to download Pandas
package used in my spider.
下面的代码是调用 URL 文本文件的地方.我重新设计了启动新项目时启动的默认蜘蛛.当我在自己的电脑上运行蜘蛛时;它按需要运行.以下是调用url_list_20171028Z.txt"文件以获取 URL 以从中获取数据的部分代码:
The code below is where the URL text file is called. I reworked the default spider initiated when a new project is started. When I run the spider on my own computer; it operates as desired. Here is the portion of code that calls on `url_list_20171028Z.txt' file to get the URLs to get data from:
def start_requests(self):
s_time = strftime("%Y%m%d" ,gmtime())
url_file = 'url_list_{0}Z.txt'.format(s_time)
for line in fileinput.input({url_file}):
url = str.strip(line)
yield scrapy.Request(url=url, callback=self.parse)
非常感谢您抽出时间帮助我解决这个问题.如果您需要我添加更多信息,请告诉我!谢谢!
推荐答案
您需要在 setup.py
文件的 package_data 部分声明文件.
You need to declare the files in the package_data section of your setup.py
file.
例如,如果您的 Scrapy 项目具有以下结构:
For example, if your Scrapy project has the following structure:
myproject/
__init__.py
settings.py
resources/
cities.txt
scrapy.cfg
setup.py
您可以在 setup.py
中使用以下内容来包含 cities.txt
文件:
You would use the following in your setup.py
to include the cities.txt
file:
setup(
name='myproject',
version='1.0',
packages=find_packages(),
package_data={
'myproject': ['resources/*.txt']
},
entry_points={
'scrapy': ['settings = myproject.settings']
},
zip_safe=False,
)
请注意,zip_safe
标志设置为 False ,因为在某些情况下可能需要这样做.
Note that the zip_safe
flag is set to False , as this may be needed in some cases.
现在您可以像这样从 setting.py
访问 cities.txt
文件内容:
Now you can access the cities.txt
file content from setting.py
like this:
import pkgutil
data = pkgutil.get_data("myproject", "resources/cities.txt")
这篇关于部署到 Scraping Hub 和蜘蛛运行时找不到 URL 文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!