在google-cloud-dataflow中使用文件模式匹配时如何获取文件名 [英] How to Get Filename when using file pattern match in google-cloud-dataflow

查看:122
本文介绍了在google-cloud-dataflow中使用文件模式匹配时如何获取文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道如何在google-cloud-dataflow中使用文件模式匹配时获取文件名吗?

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?

我是newbee以使用数据流.这样,在使用文件样式匹配时如何获取文件名.

I'm newbee to use dataflow. How to get filename when use file patten match, in this way.

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))

我想知道如何检测kinglear.txt,Hamlet.txt等文件名.

I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.

推荐答案

如果您只想扩展filepattern并获取与之匹配的文件名列表,则可以使用GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt")(请参阅

If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).

如果您想从管道中的DoFn下游之一访问当前文件名"-目前不支持(尽管有一些解决方法-参见下文).这是一个常见的功能请求,我们仍在考虑如何以一种自然,通用和高性能的方式将其最好地适合到框架中.

If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.

一些解决方法包括:

  • Writing a pipeline like this (the tf-idf example uses this approach):

    DoFn readFile = ...(takes a filename, reads the file and produces records)...
    p.apply(Create.of(filenames))
     .apply(ParDo.of(readFile))
     .apply(the rest of your pipeline)

这有一个缺点,那就是动态工作平衡功能不能很好地工作,因为它们目前仅适用于Read PTransform的级别,而不适用于具有高扇出度的ParDo的级别(例如此处的那个),会读取文件并产生所有记录);并行化仅适用于文件级别,但文件不会拆分为子范围.从阅读莎士比亚的规模来看,这不是问题,但是,如果您正在阅读一组大小迥异,非常大的文件,则可能会成为问题.

This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.

  • 实现自己的FileBasedSource(常规文档),它将返回类型类似于Pair<String, T>的记录,其中String是文件名,而T是您正在读取的记录.在这种情况下,框架将为您处理匹配的文件模式,动态工作重新平衡会很好地工作,但是由您决定在FileBasedReader中编写读取逻辑.
  • Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.

这两种解决方法都不理想,但是根据您的要求,其中一种可能会帮您解决问题.

Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.

这篇关于在google-cloud-dataflow中使用文件模式匹配时如何获取文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆