有没有一种方法可以使用Dataflow读取Excel文件 [英] Is there a way to read an Excel file using Dataflow

查看:65
本文介绍了有没有一种方法可以使用Dataflow读取Excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用Dataflow读取存储在GCS存储桶中的Excel文件?

Is there a way to read an Excel file stored in a GCS bucket using Dataflow?

我还想知道我们是否可以使用Dataflow访问GCS中对象的元数据.如果是,那怎么办?

And I would also like to know if we can access the metadata of an object in GCS using Dataflow. If yes then how?

推荐答案

CSV文件通常用于从excel读取文件.这些文件可以逐行拆分和读取,因此非常适合数据流.您可以使用TextIO.Read提取文件的每一行,然后将其解析为CSV行.

CSV files are often used to read files from excel. These files can be split and read line by line so they are ideal for dataflow. You can use TextIO.Read to pull in each line of the file, then parse them as CSV lines.

如果您要使用其他二进制excel格式,那么我认为您需要读取整个文件并使用库来解析它.如果可以的话,我建议使用CSV文件.

If you want to use a different binary excel format, then I believe that you would need to read in the entire file and use a library to parse it. I recommend using CSV files if you can.

关于读取GCS元数据.我认为您无法使用TextIO做到这一点,但是您可以直接调用GCS API来访问元数据.如果仅在程序启动时对几个文件执行此操作,则它将起作用并且不会太昂贵.如果您需要读取许多这样的文件,则将为每个文件添加一个额外的RPC.

As for reading the GCS metadata. I don't think that you can do this with TextIO, but you could call the GCS API directly to access the metadata. If you only do this for a few files at the start of your program then it will work and not be too expensive. If you need to read many files like this, you'll be adding an extra RPC for each file.

请注意不要多次读取同一文件,建议一次读取每个文件的元数据,然后将元数据写到侧面输入中.然后,在您的ParDo之一中,您可以访问每个文件的侧面输入.

Be careful to not read the same file multiple times, I suggest reading each file's metadata once once and then writing the metadata out to a side input. Then in one of your ParDo's you can access the side input for each file.

有用的链接: ETL&解析Cloud Dataflow中的CSV文件

https ://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Read

https://cloud.google.com/dataflow/model /par-do#side-inputs

这篇关于有没有一种方法可以使用Dataflow读取Excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆