跳过标题行-Cloud DataFlow是否可能? [英] Skipping header rows - is it possible with Cloud DataFlow?

查看:66
本文介绍了跳过标题行-Cloud DataFlow是否可能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个管道,该管道从GCS中的文件中读取,转换并最终写入BQ表.该文件包含标题行(字段).

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).

有什么方法可以像加载时一样在BQ中以编程方式设置要跳过的标题行数"吗?

Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?

推荐答案

当前无法实现.听起来这里有两个潜在的请求:

This is not currently possible. It sounds like there are two potential requests here:

  • 为BigQuery导入指定标题行的存在和跳过行为.
  • 指定GCS文本源应跳过标题行.

对此的未来工作在 https://issues.apache.org/中进行了跟踪jira/browse/BEAM-123 .

同时,您可以在ParDo代码中添加一个简单的过滤器以跳过标头.像这样:

Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));

这篇关于跳过标题行-Cloud DataFlow是否可能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆