尝试从 Scrapy 管道将抓取数据写入 Bigquery 时,请求的身份验证范围不足 (403) [英] Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy
问题描述
我正在尝试构建 Scrapy 爬虫:spider 将抓取数据,然后在 pipeline.py 中,数据将保存到 Bigquery.我通过 docker 构建它,设置 crontab 作业并推送到谷歌云服务器以日常运行.
I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running.
问题是当crontab 执行scrapy crawler 时,它得到google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name:请求的身份验证范围不足.".
The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.".
更多细节,当访问它的容器(docker exec -it .../bin/bash)并手动执行它(scrapy crawl spider_name)时,它就像魅力一样.数据显示在 Bigquery 中.
我使用具有 bigquery.admin 角色的服务帐户(json 文件)来设置 GOOGLE_APPLICATION_CREDENTIALS.
I use service account (json file) having bigquery.admin role to setup GOOGLE_APPLICATION_CREDENTIALS.
# spider file is fine
# pipeline.py
from google.cloud import bigquery
import logging
from scrapy.exceptions import DropItem
...
class SpiderPipeline(object):
def __init__(self):
# BIGQUERY
# Setup GOOGLE_APPLICATION_CREDENTIALS in docker file
self.client = bigquery.Client()
table_ref = self.client.dataset('dataset').table('data')
self.table = self.client.get_table(table_ref)
def process_item(self, item, spider):
if item['key']:
# BIGQUERY
'''Order: key, source, lang, created, previous_price, lastest_price, rating, review_no, booking_no'''
rows_to_insert = [( item['key'], item['source'], item['lang'])]
error = self.client.insert_rows(self.table, rows_to_insert)
if error == []:
logging.debug('...Save data to bigquery {}...'.format(item['key']))
# raise DropItem("Missing %s!" % item)
else:
logging.debug('[Error upload to Bigquery]: {}'.format(error))
return item
raise DropItem("Missing %s!" % item)
在 docker 文件中:
In docker file:
FROM python:3.5-stretch
WORKDIR /app
COPY requirements.txt ./
RUN pip install --trusted-host pypi.python.org -r requirements.txt
COPY . /app
# For Bigquery
# key.json is already in right location
ENV GOOGLE_APPLICATION_CREDENTIALS='/app/key.json'
# Sheduler cron
RUN apt-get update && apt-get -y install cron
# Add crontab file in the cron directory
ADD crontab /etc/cron.d/s-cron
# Give execution rights on the cron job
RUN chmod 0644 /etc/cron.d/s-cron
# Apply cron job
RUN crontab /etc/cron.d/s-cron
# Create the log file to be able to run tail
RUN touch /var/log/cron.log
# Run the command on container startup
CMD cron && tail -f /var/log/cron.log
在 crontab 中:
In crontab:
# Run once every day at midnight. Need empty line at the end to run.
0 0 * * * cd /app && /usr/local/bin/scrapy crawl spider >> /var/log/cron.log 2>&1
总之,如何让crontab运行爬虫不出现403错误.非常感谢大家的支持.
In conclusion, how to get crontab run crawler without 403 error. Thank anyone so much for support.
推荐答案
我建议您直接在代码中加载服务帐户,而不是从这样的环境中加载:
I suggest you load the service account directly in your code and not from the environment like this:
from google.cloud import bigquery
from google.cloud.bigquery.client import Client
service_account_file_path = "/app/key.json" # your service account auth file file
client = bigquery.Client.from_service_account_json(service_account_file_path)
其余代码应与您验证它的工作代码保持一致
The rest of the code should stay the same as you verify it's a working code
这篇关于尝试从 Scrapy 管道将抓取数据写入 Bigquery 时,请求的身份验证范围不足 (403)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!