Biqquery:有些行属于不同的分区,而不是目标分区 [英] Biqquery: Some rows belong to different partitions rather than destination partition
问题描述
我正在运行Airflow DAG,它使用Airflow 1.10.2版上的运算符GoogleCloudStorageToBigQueryOperator将数据从GCS移到BQ。
I am running a Airflow DAG which moves data from GCS to BQ using operator GoogleCloudStorageToBigQueryOperator i am on Airflow version 1.10.2.
此任务将数据从MySql移至BQ(表已分区),所有这些时间都是通过提取时间
,并且当使用Airflow DAG加载数据时,过去三天的增量负载工作正常。
This task moves data from MySql to BQ(Table partitioned), all this time we were partitioned by Ingestion-time
and the incremental load for past three days were working fine when the data was loaded using Airflow DAG.
现在,我们将分区类型更改为表中DATE列上的日期或时间戳记
,此后我们开始收到此错误,因为我们正在从MySql表获取增量负载以包含最近三天的数据,所以我期望BQ作业以追加新记录或使用 WRITE_TRUNCATE重新创建分区,该分区我已经进行了较早的测试,并且均失败,并显示以下错误消息。
Now we changed the partitioned type to be Date or timestamp
on a DATE column from the table, after which we have started getting this error, since we are getting the incremental load to have data for last three days from MySql table, I was expecting the BQ job to Append the new records or recreate the partition with 'WRITE_TRUNCATE' which i have tested earlier and both of them fail with below error message.
例外:BigQuery作业失败。最终错误是:{'原因':'无效','消息':'某些行属于不同的分区,而不是目标分区20191202'}。
由于所有模块均基于JSON参数被调用,因此我将无法发布代码,但这是我将使用其他常规参数传递给此表的运算符的地方
I wont be able to post the code since, all modules being called based on JSON parameter, but here is what I am passing to the operator for this table with other regular parameters
create_disposition='CREATE_IF_NEEDED',
time_partitioning = {'field': 'entry_time', 'type': 'DAY'}
write_disposition = 'WRITE_APPEND' #Tried with 'WRITE_TRUNCATE'
schema_update_options = ('ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION')
我相信这些是可能引起问题的字段,对此表示感谢。
I believe these are the fields which might cause the issue, any help on this is appreciated.
推荐答案
按日期或时间戳记使用Bigquery分区表时,应指定分区以加载数据。
例如,
When using Bigquery partitioned tables by Date or timestamp, you should specify the partition to load the data. E.g
table_name$20160501
此外,您的列值应与分区匹配,例如,如果创建此表:
Also, your column value should match the partition, for example, if you create this table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
日期列是分区的基于列的列,并且如果您尝试加载下一行
The column date is the column-based for the partition and if you try to load the next row
$ echo "1,3.14,2018-11-07" > row.csv
$ bq "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
由于使用分区 20181107
Some rows belong to different partitions rather than destination partition 20181105
我建议使用以下 destination_project_dataset_table 值并验证数据是否与分区日期匹配。
I suggest to use the following destination_project_dataset_table value and verify if the data match to the partition date.
destination_project_dataset_table='dataset.table$YYYYMMDD',
这篇关于Biqquery:有些行属于不同的分区,而不是目标分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!