Cloud Dataflow到BigQuery-来源过多 [英] Cloud Dataflow to BigQuery - too many sources

查看:72
本文介绍了Cloud Dataflow到BigQuery-来源过多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一份工作,除其他事项外,还将从文件中读取的一些数据插入到BigQuery表中,以供以后进行手动分析.

I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.

它失败并出现以下错误:

It fails with the following error:

job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.

什么是源"?是文件步骤还是管道步骤?

What does it refer to as "source"? Is it a file or a pipeline step?

谢谢, G

推荐答案

我猜测该错误来自BigQuery,这意味着在创建输出表时我们试图上传太多文件.

I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.

您能否提供有关错误/上下文的更多详细信息(例如命令行输出的摘要(如果使用BlockingDataflowPipelineRunner),以便我可以确认吗?jobId也会有所帮助.

Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.

关于您的管道结构,是否会产生大量输出文件?这可能是大量数据,也可能是经过精细分片的输入文件,而没有随后的GroupByKey操作(这会使我们将数据重新分片成更大的块).

Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).

这篇关于Cloud Dataflow到BigQuery-来源过多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆