尝试从Scrapy的管道将抓取数据写入Bigquery时，请求的身份验证范围不足(403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

查看：196 发布时间：2020/7/6 6:51:07 python cron google-bigquery scrapy-pipeline

本文介绍了尝试从Scrapy的管道将抓取数据写入Bigquery时，请求的身份验证范围不足(403)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试构建Scrapy爬网程序:Spider将对数据进行爬网，然后在pipeline.py中将数据保存到Bigquery.我是由docker构建的，设置了crontab作业，并推送到Google Cloud Server以使其每日运行.

I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running.

问题是crontab执行刮scrap的抓取工具时，出现了"google.api_core.exceptions.Forbidden:403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name :请求的身份验证范围不足."

The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.".

有关更多详细信息，当访问其容器(docker exec -it .../bin/bash)并手动执行(scrapy crawl spider_name)时，它的工作原理类似于charm.数据显示在Bigquery中.

我使用具有bigquery.admin角色的服务帐户(json文件)来设置GOOGLE_APPLICATION_CREDENTIALS.

I use service account (json file) having bigquery.admin role to setup GOOGLE_APPLICATION_CREDENTIALS.

# spider file is fine

# pipeline.py
from google.cloud import bigquery
import logging
from scrapy.exceptions import DropItem
...

class SpiderPipeline(object):
    def __init__(self):

        # BIGQUERY
        # Setup GOOGLE_APPLICATION_CREDENTIALS in docker file
        self.client = bigquery.Client()
        table_ref = self.client.dataset('dataset').table('data')
        self.table = self.client.get_table(table_ref)

    def process_item(self, item, spider):
        if item['key']:

            # BIGQUERY
            '''Order: key, source, lang, created, previous_price, lastest_price, rating, review_no, booking_no'''
            rows_to_insert = [( item['key'], item['source'], item['lang'])]
            error = self.client.insert_rows(self.table, rows_to_insert)
            if error == []:
                logging.debug('...Save data to bigquery {}...'.format(item['key']))
                # raise DropItem("Missing %s!" % item)
            else:
                logging.debug('[Error upload to Bigquery]: {}'.format(error))

            return item
        raise DropItem("Missing %s!" % item)

在docker文件中:

In docker file:

FROM python:3.5-stretch

WORKDIR /app

COPY requirements.txt ./

RUN pip install --trusted-host pypi.python.org -r requirements.txt

COPY . /app

# For Bigquery
# key.json is already in right location
ENV GOOGLE_APPLICATION_CREDENTIALS='/app/key.json'

# Sheduler cron

RUN apt-get update && apt-get -y install cron

# Add crontab file in the cron directory
ADD crontab /etc/cron.d/s-cron

# Give execution rights on the cron job
RUN chmod 0644 /etc/cron.d/s-cron

# Apply cron job
RUN crontab /etc/cron.d/s-cron

# Create the log file to be able to run tail
RUN touch /var/log/cron.log

# Run the command on container startup
CMD cron && tail -f /var/log/cron.log

在crontab中:

# Run once every day at midnight. Need empty line at the end to run.
0 0 * * * cd /app && /usr/local/bin/scrapy crawl spider >> /var/log/cron.log 2>&1

总而言之，如何使crontab运行搜寻器而不会出现403错误.非常感谢任何人的支持.

In conclusion, how to get crontab run crawler without 403 error. Thank anyone so much for support.

尝试从Scrapy的管道将抓取数据写入Bigquery时，请求的身份验证范围不足(403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

尝试从Scrapy的管道将抓取数据写入Bigquery时，请求的身份验证范围不足(403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭