尝试从 Scrapy 管道将抓取数据写入 Bigquery 时，请求的身份验证范围不足 (403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

查看：19 发布时间：2021/12/30 23:15:44 python cron google-bigquery scrapy-pipeline

本文介绍了尝试从 Scrapy 管道将抓取数据写入 Bigquery 时，请求的身份验证范围不足 (403)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试构建 Scrapy 爬虫:spider 将抓取数据，然后在 pipeline.py 中，数据将保存到 Bigquery.我通过 docker 构建它，设置 crontab 作业并推送到谷歌云服务器以日常运行.

I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running.

问题是当crontab 执行scrapy crawler 时，它得到google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name:请求的身份验证范围不足.".

The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.".

更多细节，当访问它的容器(docker exec -it .../bin/bash)并手动执行它(scrapy crawl spider_name)时，它就像魅力一样.数据显示在 Bigquery 中.

我使用具有 bigquery.admin 角色的服务帐户(json 文件)来设置 GOOGLE_APPLICATION_CREDENTIALS.

I use service account (json file) having bigquery.admin role to setup GOOGLE_APPLICATION_CREDENTIALS.

# spider file is fine

# pipeline.py
from google.cloud import bigquery
import logging
from scrapy.exceptions import DropItem
...

class SpiderPipeline(object):
    def __init__(self):

        # BIGQUERY
        # Setup GOOGLE_APPLICATION_CREDENTIALS in docker file
        self.client = bigquery.Client()
        table_ref = self.client.dataset('dataset').table('data')
        self.table = self.client.get_table(table_ref)

    def process_item(self, item, spider):
        if item['key']:

            # BIGQUERY
            '''Order: key, source, lang, created, previous_price, lastest_price, rating, review_no, booking_no'''
            rows_to_insert = [( item['key'], item['source'], item['lang'])]
            error = self.client.insert_rows(self.table, rows_to_insert)
            if error == []:
                logging.debug('...Save data to bigquery {}...'.format(item['key']))
                # raise DropItem("Missing %s!" % item)
            else:
                logging.debug('[Error upload to Bigquery]: {}'.format(error))

            return item
        raise DropItem("Missing %s!" % item)

在 docker 文件中:

In docker file:

FROM python:3.5-stretch

WORKDIR /app

COPY requirements.txt ./

RUN pip install --trusted-host pypi.python.org -r requirements.txt

COPY . /app

# For Bigquery
# key.json is already in right location
ENV GOOGLE_APPLICATION_CREDENTIALS='/app/key.json'

# Sheduler cron

RUN apt-get update && apt-get -y install cron

# Add crontab file in the cron directory
ADD crontab /etc/cron.d/s-cron

# Give execution rights on the cron job
RUN chmod 0644 /etc/cron.d/s-cron

# Apply cron job
RUN crontab /etc/cron.d/s-cron

# Create the log file to be able to run tail
RUN touch /var/log/cron.log

# Run the command on container startup
CMD cron && tail -f /var/log/cron.log

在 crontab 中:

In crontab:

# Run once every day at midnight. Need empty line at the end to run.
0 0 * * * cd /app && /usr/local/bin/scrapy crawl spider >> /var/log/cron.log 2>&1

总之，如何让crontab运行爬虫不出现403错误.非常感谢大家的支持.

In conclusion, how to get crontab run crawler without 403 error. Thank anyone so much for support.

尝试从 Scrapy 管道将抓取数据写入 Bigquery 时，请求的身份验证范围不足 (403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

尝试从 Scrapy 管道将抓取数据写入 Bigquery 时，请求的身份验证范围不足 (403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭