部署到 Scraping Hub 和蜘蛛运行时找不到 URL 文本文件 [英] URL text file not found when deployed to Scraping Hub and spider run

查看:36
本文介绍了部署到 Scraping Hub 和蜘蛛运行时找不到 URL 文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的蜘蛛依赖于一个 .txt 文件,该文件包含蜘蛛访问的 URL.我将该文件放在蜘蛛代码所在的同一目录中,以及它之前的每个目录中(Hail Marry 方法);最终结果是这样的:

My spider relies on a .txt file that contains the URLs the spider goes to. I have placed that file in the same directory the spider code is located, and in every directory before it (Hail Marry approach); the end result is this:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/app/__main__.egg/CCSpider1/spiders/cc_1_spider.py", line 41, in start_requests
    for line in fileinput.input({url_file}):
  File "/usr/local/lib/python2.7/fileinput.py", line 237, in next
    line = self._readline()
  File "/usr/local/lib/python2.7/fileinput.py", line 339, in _readline
    self._file = open(self._filename, self._mode)
IOError: [Errno 2] No such file or directory: 'url_list_20171028Z.txt' 

问题

如何确保在运行爬虫时始终可以找到 url_list_20171028Z.txt?此 URL 文本文件每天更新​​(新的文件会在第二天加盖戳记——例如:url_list_20171029Z.txt 等).

Question

How do I ensure that url_list_20171028Z.txt is always found when I run my spider? This URL text file updates every day (a new one is stamped with the next day -- e.x: url_list_20171029Z.txt, etc.).

感谢您解决我的问题.我是 Python 的新手(从 2017 年 6 月开始学习),我将这个抓取项目作为乐趣和学习经验.我最近才开始使用scrapy(2017 年10 月),因此对于任何明显的简单性从我脑海中掠过,我深表歉意.

Thank you for taking a crack at my issue. I am a new to Python (started learning in June, 2017) and I am taking this scraping project for fun and as a learning expirience. I only started using scrapy recently (October 2017), so apologies for any blatant simplicity passing over my head.

此项目已上传到Scraping Hub 网站.当我尝试从 Scraping Hub 仪表板运行我的蜘蛛时,会出现这个问题.蜘蛛部署成功,我制作了一个requirements.txt文件下载我的蜘蛛使用的Pandas包.

This project has been uploaded to Scraping Hub website. This issues is popping up when I try to run my spider from the Scraping Hub dashboard. The deployment of the spider was successful, and I made a requirements.txt file to download Pandas package used in my spider.

下面的代码是调用 URL 文本文件的地方.我重新设计了启动新项目时启动的默认蜘蛛.当我在自己的电脑上运行蜘蛛时;它按需要运行.以下是调用url_list_20171028Z.txt"文件以获取 URL 以从中获取数据的部分代码:

The code below is where the URL text file is called. I reworked the default spider initiated when a new project is started. When I run the spider on my own computer; it operates as desired. Here is the portion of code that calls on `url_list_20171028Z.txt' file to get the URLs to get data from:

def start_requests(self):
        s_time = strftime("%Y%m%d" ,gmtime())
        url_file = 'url_list_{0}Z.txt'.format(s_time)
        for line in fileinput.input({url_file}):
            url = str.strip(line)
            yield scrapy.Request(url=url, callback=self.parse)

非常感谢您抽出时间帮助我解决这个问题.如果您需要我添加更多信息,请告诉我!谢谢!

推荐答案

您需要在 setup.py 文件的 package_data 部分声明文件.

You need to declare the files in the package_data section of your setup.py file.

例如,如果您的 Scrapy 项目具有以下结构:

For example, if your Scrapy project has the following structure:

myproject/
  __init__.py
  settings.py
  resources/
    cities.txt
scrapy.cfg
setup.py

您可以在 setup.py 中使用以下内容来包含 cities.txt 文件:

You would use the following in your setup.py to include the cities.txt file:

setup(
    name='myproject',
    version='1.0',
    packages=find_packages(),
    package_data={
        'myproject': ['resources/*.txt']
    },
    entry_points={
        'scrapy': ['settings = myproject.settings']
    },
    zip_safe=False,
)

请注意,zip_safe 标志设置为 False ,因为在某些情况下可能需要这样做.

Note that the zip_safe flag is set to False , as this may be needed in some cases.

现在您可以像这样从 setting.py 访问 cities.txt 文件内容:

Now you can access the cities.txt file content from setting.py like this:

import pkgutil

data = pkgutil.get_data("myproject", "resources/cities.txt")

这篇关于部署到 Scraping Hub 和蜘蛛运行时找不到 URL 文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆