Cloud Dataflow到BigQuery-来源过多 [英] Cloud Dataflow to BigQuery - too many sources
问题描述
我有一份工作,除其他事项外,还将从文件中读取的一些数据插入到BigQuery表中,以供以后进行手动分析.
I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
它失败并出现以下错误:
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
什么是源"?是文件步骤还是管道步骤?
What does it refer to as "source"? Is it a file or a pipeline step?
谢谢, G
推荐答案
我猜测该错误来自BigQuery,这意味着在创建输出表时我们试图上传太多文件.
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
您能否提供有关错误/上下文的更多详细信息(例如命令行输出的摘要(如果使用BlockingDataflowPipelineRunner),以便我可以确认吗?jobId也会有所帮助.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
关于您的管道结构,是否会产生大量输出文件?这可能是大量数据,也可能是经过精细分片的输入文件,而没有随后的GroupByKey操作(这会使我们将数据重新分片成更大的块).
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
这篇关于Cloud Dataflow到BigQuery-来源过多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!