Biqquery:有些行属于不同的分区,而不是目标分区 [英] Biqquery: Some rows belong to different partitions rather than destination partition

查看:150
本文介绍了Biqquery:有些行属于不同的分区,而不是目标分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Airflow DAG,它使用Airflow 1.10.2版上的运算符GoogleCloudStorageToBigQueryOperator将数据从GCS移到BQ。

I am running a Airflow DAG which moves data from GCS to BQ using operator GoogleCloudStorageToBigQueryOperator i am on Airflow version 1.10.2.

此任务将数据从MySql移至BQ(表已分区),所有这些时间都是通过提取时间,并且当使用Airflow DAG加载数据时,过去三天的增量负载工作正常。

This task moves data from MySql to BQ(Table partitioned), all this time we were partitioned by Ingestion-time and the incremental load for past three days were working fine when the data was loaded using Airflow DAG.

现在,我们将分区类型更改为表中DATE列上的日期或时间戳记,此后我们开始收到此错误,因为我们正在从MySql表获取增量负载以包含最近三天的数据,所以我期望BQ作业以追加新记录或使用 WRITE_TRUNCATE重新创建分区,该分区我已经进行了较早的测试,并且均失败,并显示以下错误消息。

Now we changed the partitioned type to be Date or timestamp on a DATE column from the table, after which we have started getting this error, since we are getting the incremental load to have data for last three days from MySql table, I was expecting the BQ job to Append the new records or recreate the partition with 'WRITE_TRUNCATE' which i have tested earlier and both of them fail with below error message.

例外:BigQuery作业失败。最终错误是:{'原因':'无效','消息':'某些行属于不同的分区,而不是目标分区20191202'}。

由于所有模块均基于JSON参数被调用,因此我将无法发布代码,但这是我将使用其他常规参数传递给此表的运算符的地方

I wont be able to post the code since, all modules being called based on JSON parameter, but here is what I am passing to the operator for this table with other regular parameters

create_disposition='CREATE_IF_NEEDED',
time_partitioning = {'field': 'entry_time', 'type': 'DAY'}
write_disposition = 'WRITE_APPEND' #Tried with 'WRITE_TRUNCATE'
schema_update_options = ('ALLOW_FIELD_ADDITION',
                                 'ALLOW_FIELD_RELAXATION')

我相信这些是可能引起问题的字段,对此表示感谢。

I believe these are the fields which might cause the issue, any help on this is appreciated.

推荐答案

按日期或时间戳记使用Bigquery分区表时,应指定分区以加载数据
例如,

When using Bigquery partitioned tables by Date or timestamp, you should specify the partition to load the data. E.g

table_name$20160501

此外,您的列值应与分区匹配,例如,如果创建此表:

Also, your column value should match the partition, for example, if you create this table:

$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"

日期列是分区的基于列的列,并且如果您尝试加载下一行

The column date is the column-based for the partition and if you try to load the next row

$ echo "1,3.14,2018-11-07" > row.csv
$ bq "tmp_elliottb.PartitionedTable\$20181105" ./row.csv

由于使用分区 20181107

Some rows belong to different partitions rather than destination partition 20181105

我建议使用以下 destination_project_dataset_table 值并验证数据是否与分区日期匹配。

I suggest to use the following destination_project_dataset_table value and verify if the data match to the partition date.

destination_project_dataset_table='dataset.table$YYYYMMDD',

这篇关于Biqquery:有些行属于不同的分区,而不是目标分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆