使用 Dataflow 读取 CSV 标头 [英] Reading CSV header with Dataflow

查看:30
本文介绍了使用 Dataflow 读取 CSV 标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 CSV 文件,但我提前不知道列名.我需要在 Google Dataflow 中进行一些转换后以 JSON 格式输出数据.

I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow.

获取标题行并将标签渗透到所有行的最佳方法是什么?

What's the best way to take the header row and permeate the labels through all the rows?

例如:

a,b,c
1,2,3
4,5,6

...变成(大约):

{a:1, b:2, c:3}
{a:4, b:5, c:6}

推荐答案

您应该实现自定义 FileBasedSource(类似于 TextIO.TextSource),它将读取第一行并存储标题数据

You should implement custom FileBasedSource (similar to TextIO.TextSource), that will read the first line and store header data

    @Override
    protected void startReading(final ReadableByteChannel channel)
    throws IOException {
        lineReader = new LineReader(channel);

        if (lineReader.readNextLine()) {
            final String headerLine = lineReader.getCurrent().trim();
            header = headerLine.split(",");
            readingStarted = true;
        }
    }

和后者,在读取其他行时将其添加到当前行数据中:

and latter, while reading other lines prepend it to current line data:

    @Override
    protected boolean readNextRecord() throws IOException {
        if (!lineReader.readNextLine()) {
            return false;
        }

        final String line = lineReader.getCurrent();
        final String[] data = line.split(",");

        // assumes all lines are valid
        final StringBuilder record = new StringBuilder();
        for (int i = 0; i < header.length; i++) {
            record.append(header[i]).append(":").append(data[i]).append(", ");
        }

        currentRecord = record.toString();
        return true;
    }

我已经实施了一个快速(完整)的解决方案,可在 github 上获得.我还添加了一个数据流单元测试来演示阅读:

I've implemented a quick (complete) solution, available on github. I also added a dataflow unit test to demonstrate reading:

@Test
public void test_reading() throws Exception {
    final File file =
            new File(getClass().getResource("/sample.csv").toURI());
    assertThat(file.exists()).isTrue();

    final Pipeline pipeline = TestPipeline.create();

    final PCollection<String> output =
            pipeline.apply(Read.from(CsvWithHeaderFileSource.from(file.getAbsolutePath())));

    DataflowAssert
            .that(output)
            .containsInAnyOrder("a:1, b:2, c:3, ", "a:4, b:5, c:6, ");

    pipeline.run();
}

其中 sample.csv 具有以下内容:

where sample.csv has following content:

a,b,c
1,2,3
4,5,6

这篇关于使用 Dataflow 读取 CSV 标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆