将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据 [英] Convert XML data from one table in Google BigQuery to JSON data in another column in the same table

查看:37
本文介绍了将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google BigQuery中有下表(此处仅显示几行):

I have the following table in Google BigQuery (only a few lines are shown here):

id     loaded_date     data
1      2019-10-25      <collection><row><field name="Item Key" type="text" value="Haircolour - Avstemming kunder - OMT" /><field name="Created" type="datetime" value="2019-10-25 17:35:17Z" /><field name="Type" type="text" value="Session Provisioning Failure" /></row></collection>
2      2019-10-25      <collection><row><field name="Item Key" type="text" value="Haircolour - Avstemming kunder - OMT" /><field name="Created" type="datetime" value="2019-10-25 17:51:32Z" /><field name="Type" type="text" value="Session Provisioning Failure" /></row></collection>
3      2019-02-23      <collection><row><field name="Item Key" type="text" value="Haircolour - Hent klienter til kø" /><field name="Last Generation Time" type="datetime" value="2019-02-23 11:00:36Z" /><field name="Priority" type="number" value="-3" /></row></collection>

我的数据列为XML格式.我想在此表中添加第四列,例如 data_json ,其中包含与 data 列中相同的数据,但格式为JSON.

My data column is in XML format. I would like to add a fourth column to this table for example called data_json containing the same data as in the data column but in JSON format.

这意味着我想得到以下结果:

This means that I would like to end up with the following results:

id     loaded_date     data                    data_json
1      2019-10-25      Same data as before     {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Avstemming kunder - OMT"},{"-name": "Created","-type": "datetime","-value": "2019-10-25 17:35:17Z"},{"-name": "Type","-type": "text","-value": "Session Provisioning Failure"}]}}}
2      2019-10-25      Same data as before     {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Avstemming kunder - OMT"},{"-name": "Created","-type": "datetime","-value": "2019-10-25 17:51:32Z"},{"-name": "Type","-type": "text","-value": "Session Provisioning Failure"}]}}}
3      2019-02-23      Same data as before     {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Hent klienter til kø"},{"-name": "Last Generation Time","-type": "datetime","-value": "2019-02-23 11:00:36Z"},{"-name": "Priority","-type": "number","-value": "-3"}]}}}

有没有一种方法可以直接在BIgquery中使用SQL或使用Python?

Is there a way to do that using SQL directly in BIgquery, or using Python?

谢谢

推荐答案

为了更新BigQuery中的数据,您可以查看

In order to update data in BigQuery you can take a look at Data Manipulation Language, but take into account that it has its own quotas. In your case, I would consider creating a new table from the existing one, and treating the XML field in Python in order to parse it to JSON format.

我已使用Python的Google Cloud Client库重现了工作流程,并且可以与下面的附加代码一起正常使用.此代码的工作方式如下:

I have reproduced the workflow on my end, using Google Cloud Client libraries for Python and it works properly with the attached code below. This code works as follows:

  • 将表格CSV文件导出到GCS存储桶
  • 将CSV文件从GCS存储桶下载到您的计算机
  • 将列追加到名为"JSON_data"的输入数据框
  • 将"JSON_data"列中的XML列数据"解析为JSON格式
  • 使用新数据创建新的BigQuery表

为了创建BigQuery表,我遵循了

In order to create the BigQuery table I have followed this StackOverflow thread.

您将必须设置自己的变量(bucket_name,项目,dataset_id,table_id,位置).请记住,将GCS存储桶与BigQuery数据集放在同一区域.

You will have to set your own variables (bucket_name, project, dataset_id, table_id, location). Remember to have your GCS bucket in the same region as your BigQuery dataset.

import xmltodict, json
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd


#Define bigquery Client
client = bigquery.Client()

#Extract job
bucket_name = <YOUR_BUCKET_NAME>
project = <YOUR_PROJECT_ID>
dataset_id = <YOUR_DATASET_ID>
table_id = <YOUR_TABLE_ID>
location = <YOUR_TABLE_LOCATION>


def export_dataset(bucket_name, dataset_id, project, table_id):

    destination_uri = "gs://{}/{}".format(bucket_name, "bq_table.csv")
    dataset_ref = client.dataset(dataset_id, project=project)
    table_ref = dataset_ref.table(table_id)

    extract_job = client.extract_table(
        table_ref,
        destination_uri,
        # Location must match that of the source table.
        location=location,
    )  # API request
    extract_job.result()  # Waits for job to complete.

    print(
        "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, 
destination_uri)
    )


#Execute export job    
export_dataset(bucket_name, dataset_id, project, table_id)


#--------------------------------------------

#Retrieve CSV file from GCS bucket
source_blob_name = "bq_table.csv"
destination_file_name = "bq_table.csv"

def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(source_blob_name)

    blob.download_to_filename(destination_file_name)

    print('Blob {} downloaded to {}.'.format(
        source_blob_name,
        destination_file_name))

#Download CSV from bucket
download_blob(bucket_name, source_blob_name, destination_file_name)

#--------------------------------------------

#Declare XML column name
XML_col = 'data' 

#Read CSV as Pandas DF
df = pd.read_csv('bq_table.csv')
#Append JSON_data column
df['JSON_data'] = ''
#Transform XML and save in Array
JSON_arr = [json.dumps(xmltodict.parse(df[XML_col].values[i])) for i in 
 range(len(df[XML_col]))]
#Set transformed data to column JSON_data
df.loc[:,'JSON_data'] = JSON_arr
#df to CSV - Generete output file
df.to_csv('new_data.csv', index=False, sep=',')

#----------------------------------------------


#Now we will create the new table with the new CSV 
csv_path='gs://{}/new_data.csv'.format(bucket_name)
new_table='new_table'


#Define schema for table
schema = [
        bigquery.SchemaField("id", "INTEGER"),
        bigquery.SchemaField("loaded_date", "DATE"),
        bigquery.SchemaField("JSON_data", "STRING"),   
    ]

#https://stackoverflow.com/questions/44947369/load-the-csv-file-into-big-query-auto- 
detect-schema-using-python-api
def insertTable(datasetName, tableName, csvFilePath, schema=None):
    """
    This function creates a table in given dataset in our default project
    and inserts the data given via a csv file.

    :param datasetName: The name of the dataset to be created
    :param tableName: The name of the dataset in which the table needs to be created
    :param csvFilePath: The path of the file to be inserted
    :param schema: The schema of the table to be created
    :return: returns nothing
    """

    csv_file = open(csvFilePath, 'rb')

    dataset_ref = client.dataset(datasetName)        
    from google.cloud.bigquery import Dataset
   dataset = Dataset(dataset_ref)

    table_ref = dataset.table(tableName)
    if schema is not None:
        table = bigquery.Table(table_ref,schema)
    else:
        table = bigquery.Table(table_ref)

    try:
        client.delete_table(table)
    except:
        pass

    table = client.create_table(table)

    from google.cloud.bigquery import LoadJobConfig        
    job_config = LoadJobConfig()
    table_ref = dataset.table(tableName)
    job_config.source_format = 'CSV'
    job_config.skip_leading_rows = 1
    job_config.autodetect = True
    job = client.load_table_from_file(
        csv_file, table_ref, job_config=job_config)
    job.result()

insertTable(dataset_id, new_table, 'new_data.csv', schema)

请让我知道这是否对您有用.

Please, let me know if this worked for you.

这篇关于将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆