在Python数据流/Apache Beam上启动CloudSQL代理 [英] Start CloudSQL Proxy on Python Dataflow / Apache Beam

查看:79
本文介绍了在Python数据流/Apache Beam上启动CloudSQL代理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在从事ETL Dataflow作业(使用Apache Beam Python SDK),该作业从CloudSQL查询数据(使用psycopg2和自定义ParDo)并将其写入BigQuery.我的目标是创建一个数据流模板,该模板可以使用Cron作业从AppEngine开始.

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a AppEngine using a Cron job.

我有一个使用DirectRunner在本地工作的版本.为此,我使用CloudSQL(Postgres)代理客户端,以便可以连接到127.0.0.1上的数据库.

I have a version which works locally using the DirectRunner. For that I use the CloudSQL (Postgres) proxy client so that I can connect to the database on 127.0.0.1 .

将DataflowRunner与自定义命令一起使用以在setup.py脚本中启动代理时,该作业将不会执行. 坚持重复此日志消息:

When using the DataflowRunner with custom commands to start the proxy within a setup.py script, the job won't execute. It stucks with repeating this log-message:

Setting node annotation to enable volume controller attach/detach

setup.py的一部分看起来如下:

A part of my setup.py looks the following:

CUSTOM_COMMANDS = [
['echo', 'Custom command worked!'],
['wget', 'https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64', '-O', 'cloud_sql_proxy'],
['echo', 'Proxy downloaded'],
['chmod', '+x', 'cloud_sql_proxy']]

class CustomCommands(setuptools.Command):
  """A setuptools Command class able to run arbitrary commands."""

  def initialize_options(self):
    pass

  def finalize_options(self):
    pass

  def RunCustomCommand(self, command_list):
    print('Running command: %s' % command_list)
    logging.info("Running custom commands")
    p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    # Can use communicate(input='y\n'.encode()) if the command run requires
    # some confirmation.
    stdout_data, _ = p.communicate()
    print('Command output: %s' % stdout_data)
    if p.returncode != 0:
      raise RuntimeError(
          'Command %s failed: exit code: %s' % (command_list, p.returncode))

  def run(self):
    for command in CUSTOM_COMMANDS:
      self.RunCustomCommand(command)
    subprocess.Popen(['./cloud_sql_proxy', '-instances=bi-test-1:europe-west1:test-animal=tcp:5432'])

在阅读此run()中将最后一行添加为单独的subprocess.Popen() a>从 sthomp > Github上发出的问题关于Stackoverflo的讨论.我还尝试使用subprocess.Popen的一些参数.

I added the last line as separate subprocess.Popen() within run() after reading this issue on Github from sthomp and this discussion on Stackoverflo. I also tried to play around with some parameters of subprocess.Popen.


brodin 中另一个提到的解决方案是允许从每个IP地址进行访问,并通过用户名和密码进行连接.据我了解,他并不认为这是最佳做法.

Another mentioned solution from brodin was to allow access from every IP address and to connect via username and password. In my understanding he does not claim this as best practice.

在此先感谢您的帮助.

!!!这篇文章底部的解决方法!!!

这些是作业期间发生的错误级别的日志:

These are the logs on error level which occur during a job:

E  EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities 
E  Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container / 
E  Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container / 
E  Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container / 
E  [ContainerManager]: Fail to get rootfs information unable to find data for container / 
E  Could not find capacity information for resource storage.kubernetes.io/scratch 
E  debconf: delaying package configuration, since apt-utils is not installed 
E    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
E                                   Dload  Upload   Total   Spent    Left  Speed 
E  
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3698  100  3698    0     0  25674      0 --:--:-- --:--:-- --:--:-- 25860 



#-- HERE IS WHEN setup.py FOR MY JOB IS EXECUTED ---

E  debconf: delaying package configuration, since apt-utils is not installed 
E  insserv: warning: current start runlevel(s) (empty) of script `stackdriver-extractor' overrides LSB defaults (2 3 4 5). 
E  insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `stackdriver-extractor' overrides LSB defaults (0 1 6). 
E  option = Interval; value = 60.000000; 
E  option = FQDNLookup; value = false; 
E  Created new plugin context. 
E  option = PIDFile; value = /var/run/stackdriver-agent.pid; 
E  option = Interval; value = 60.000000; 
E  option = FQDNLookup; value = false; 
E  Created new plugin context. 


在这里,您可以找到自定义setup.py开始后的所有日志(日志级别:任何;所有日志):


Here you can find are all logs after the start of my custom setup.py (log-level: any; all logs):

https://jpst.it/1gk2Z

作业日志(一段时间没有卡住后,我手动取消了该作业):

Job logs (I manually canceled the job after not stucking for a while):

 2018-06-08 (08:02:20) Autoscaling is enabled for job 2018-06-07_23_02_20-5917188751755240698. The number of workers will b...
 2018-06-08 (08:02:20) Autoscaling was automatically enabled for job 2018-06-07_23_02_20-5917188751755240698.
 2018-06-08 (08:02:24) Checking required Cloud APIs are enabled.
 2018-06-08 (08:02:24) Checking permissions granted to controller Service Account.
 2018-06-08 (08:02:25) Worker configuration: n1-standard-1 in europe-west1-b.
 2018-06-08 (08:02:25) Expanding CoGroupByKey operations into optimizable parts.
 2018-06-08 (08:02:25) Combiner lifting skipped for step Save new watermarks/Write/WriteImpl/GroupByKey: GroupByKey not fol...
 2018-06-08 (08:02:25) Combiner lifting skipped for step Group watermarks: GroupByKey not followed by a combiner.
 2018-06-08 (08:02:25) Expanding GroupByKey operations into optimizable parts.
 2018-06-08 (08:02:26) Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
 2018-06-08 (08:02:26) Annotating graph with Autotuner information.
 2018-06-08 (08:02:26) Fusing adjacent ParDo, Read, Write, and Flatten operations
 2018-06-08 (08:02:26) Fusing consumer Get rows from CloudSQL tables into Begin pipeline with watermarks/Read
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/Write into Group watermarks/Reify
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/GroupByWindow into Group watermarks/Read
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WriteBundles/WriteBundles into Save new watermar...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/GroupByWindow into Save new watermark...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Reify into Save new watermarks/Write/...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Write into Save new watermarks/Write/...
 2018-06-08 (08:02:26) Fusing consumer Write to BQ into Get rows from CloudSQL tables
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/Reify into Write to BQ
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/Map(<lambda at iobase.py:926>) into Convert dict...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WindowInto(WindowIntoFn) into Save new watermark...
 2018-06-08 (08:02:26) Fusing consumer Convert dictionary list to single dictionary and json into Remove "watermark" label
 2018-06-08 (08:02:26) Fusing consumer Remove "watermark" label into Group watermarks/GroupByWindow
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/InitializeWrite into Save new watermarks/Write/W...
 2018-06-08 (08:02:26) Workflow config is missing a default resource spec.
 2018-06-08 (08:02:26) Adding StepResource setup and teardown to workflow graph.
 2018-06-08 (08:02:26) Adding workflow start and stop steps.
 2018-06-08 (08:02:26) Assigning stage ids.
 2018-06-08 (08:02:26) Executing wait step start25
 2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/DoOnce/Read+Save new watermarks/Write/WriteI...
 2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/GroupByKey/Create
 2018-06-08 (08:02:26) Starting worker pool setup.
 2018-06-08 (08:02:26) Executing operation Group watermarks/Create
 2018-06-08 (08:02:26) Starting 1 workers in europe-west1-b...
 2018-06-08 (08:02:27) Value "Group watermarks/Session" materialized.
 2018-06-08 (08:02:27) Value "Save new watermarks/Write/WriteImpl/GroupByKey/Session" materialized.
 2018-06-08 (08:02:27) Executing operation Begin pipeline with watermarks/Read+Get rows from CloudSQL tables+Write to BQ+Gr...
 2018-06-08 (08:02:36) Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently runnin...
 2018-06-08 (08:02:46) Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently runnin...
 2018-06-08 (08:03:05) Workers have started successfully.
 2018-06-08 (08:11:37) Cancel request is committed for workflow job: 2018-06-07_23_02_20-5917188751755240698.
 2018-06-08 (08:11:38) Cleaning up.
 2018-06-08 (08:11:38) Starting worker pool teardown.
 2018-06-08 (08:11:38) Stopping worker pool...
 2018-06-08 (08:12:30) Autoscaling: Reduced the number of workers to 0 based on the rate of progress in the currently runni...

堆栈跟踪:

No errors have been received in this time period.


更新:可以在下面的答案中找到解决方法

推荐答案

解决方法:

我终于找到了解决方法.我想到了通过CloudSQL实例的公共IP连接的想法.为此,您需要允许从每个IP连接到CloudSQL实例:

Workaround Solution:

I finally found a workaround. I took the idea to connect via the public IP of the CloudSQL instance. For that you needed to allow connections to your CloudSQL instance from every IP:

  1. 转到GCP中的CloudSQL实例的概述页面
  2. 点击Authorization标签
  3. 单击Add network并添加0.0.0.0/0( !!这将允许每个IP地址连接到您的实例!! )
  1. Go to the overview page of your CloudSQL instance in GCP
  2. Click on the Authorization tab
  3. Click on Add network and add 0.0.0.0/0 (!! this will allow every IP address to connect to your instance !!)

为了增加安全性,我使用了SSL密钥,并且只允许与实例的SSL连接:

To add security to the process, I used SSL keys and only allowed SSL connections to the instance:

  1. 点击SSL标签
  2. 点击Create a new certificate为您的服务器创建SSL证书
  3. 单击Create a client certificate为您的客户端创建SSL证书
  4. 点击Allow only SSL connections拒绝所有无SSL连接尝试
  1. Click on SSL tab
  2. Click on Create a new certificate to create a SSL certificate for your server
  3. Click on Create a client certificate to create a SSL certificate for you client
  4. Click on Allow only SSL connections to reject all none SSL connection attempts

此后,我将证书存储在Google Cloud Storage存储桶中并加载 在连接Dataflow作业之前将它们连接起来,即:

After that I stored the certificates in a Google Cloud Storage bucket and load them before connecting within the Dataflow job, i.e.:

import psycopg2
import psycopg2.extensions
import os
import stat
from google.cloud import storage

# Function to wait for open connection when processing parallel
def wait(conn):
    while 1:
        state = conn.poll()
        if state == psycopg2.extensions.POLL_OK:
            break
        elif state == psycopg2.extensions.POLL_WRITE:
            pass
            select.select([], [conn.fileno()], [])
        elif state == psycopg2.extensions.POLL_READ:
            pass
            select.select([conn.fileno()], [], [])
        else:
            raise psycopg2.OperationalError("poll() returned %s" % state)

# Function which returns a connection which can be used for queries
def connect_to_db(host, hostaddr, dbname, user, password, sslmode = 'verify-full'):

    # Get keys from GCS
    client = storage.Client()

    bucket = client.get_bucket(<YOUR_BUCKET_NAME>)

    bucket.get_blob('PATH_TO/server-ca.pem').download_to_filename('server-ca.pem')
    bucket.get_blob('PATH_TO/client-key.pem').download_to_filename('client-key.pem')
    os.chmod("client-key.pem", stat.S_IRWXU)
    bucket.get_blob('PATH_TO/client-cert.pem').download_to_filename('client-cert.pem')

    sslrootcert = 'server-ca.pem'
    sslkey = 'client-key.pem'
    sslcert = 'client-cert.pem'

    con = psycopg2.connect(
        host = host,
        hostaddr = hostaddr,
        dbname = dbname,
        user = user,
        password = password,
        sslmode=sslmode,
        sslrootcert = sslrootcert,
        sslcert = sslcert,
        sslkey = sslkey)
    return con

然后我在自定义的ParDo中使用这些功能来执行查询.
最小的例子:

I then use these functions in a custom ParDo to perform queries.
Minimal example:

import apache_beam as beam

class ReadSQLTableNames(beam.DoFn):
    '''
    parDo class to get all table names of a given cloudSQL database.
    It will return each table name.
    '''
    def __init__(self, host, hostaddr, dbname, username, password):
        super(ReadSQLTableNames, self).__init__()
        self.host = host
        self.hostaddr = hostaddr
        self.dbname = dbname
        self.username = username
        self.password = password

    def process(self, element):

        # Connect do database
        con = connect_to_db(host = self.host,
            hostaddr = self.hostaddr,
            dbname = self.dbname,
            user = self.username,
            password = self.password)
        # Wait for free connection
        wait_select(con)
        # Create cursor to query data
        cur = con.cursor(cursor_factory=RealDictCursor)

        # Get all table names
        cur.execute(
        """
        SELECT
        tablename as table
        FROM pg_tables
        WHERE schemaname = 'public'
        """
        )
        table_names = cur.fetchall()

        cur.close()
        con.close()
        for table_name in table_names:
            yield table_name["table"]

然后管道的一部分可能看起来像这样:

A part of the pipeline then could look like this:

# Current workaround to query all tables: 
# Create a dummy initiator PCollection with one element
init = p        |'Begin pipeline with initiator' >> beam.Create(['All tables initializer'])

tables = init   |'Get table names' >> beam.ParDo(ReadSQLTableNames(
                                                host = known_args.host,
                                                hostaddr = known_args.hostaddr,
                                                dbname = known_args.db_name,
                                                username = known_args.user,
                                                password = known_args.password))

我希望该解决方案可以帮助其他有类似问题的人

I hope this solution helps others with similar problems

这篇关于在Python数据流/Apache Beam上启动CloudSQL代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆