使用数据流读取CSV标头 [英] Reading CSV header with Dataflow

查看:118
本文介绍了使用数据流读取CSV标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,但我不知道这些列的名称.在Google Dataflow中进行一些转换后,我需要以JSON输出数据.

I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow.

获取标题行并使标签贯穿所有行的最佳方法是什么?

What's the best way to take the header row and permeate the labels through all the rows?

例如:

a,b,c
1,2,3
4,5,6

...成为(大约):

...becomes (approximately):

{a:1, b:2, c:3}
{a:4, b:5, c:6}

推荐答案

您应实现自定义 TextIO.TextSource ),它将读取第一行并存储标头数据

You should implement custom FileBasedSource (similar to TextIO.TextSource), that will read the first line and store header data

    @Override
    protected void startReading(final ReadableByteChannel channel)
    throws IOException {
        lineReader = new LineReader(channel);

        if (lineReader.readNextLine()) {
            final String headerLine = lineReader.getCurrent().trim();
            header = headerLine.split(",");
            readingStarted = true;
        }
    }

和后者,在读取其他行时,将其添加到当前行数据之前:

and latter, while reading other lines prepend it to current line data:

    @Override
    protected boolean readNextRecord() throws IOException {
        if (!lineReader.readNextLine()) {
            return false;
        }

        final String line = lineReader.getCurrent();
        final String[] data = line.split(",");

        // assumes all lines are valid
        final StringBuilder record = new StringBuilder();
        for (int i = 0; i < header.length; i++) {
            record.append(header[i]).append(":").append(data[i]).append(", ");
        }

        currentRecord = record.toString();
        return true;
    }

我已经实现了一个快速(完整)的解决方案,可以在 github 上找到.我还添加了一个数据流单元测试来演示阅读:

I've implemented a quick (complete) solution, available on github. I also added a dataflow unit test to demonstrate reading:

@Test
public void test_reading() throws Exception {
    final File file =
            new File(getClass().getResource("/sample.csv").toURI());
    assertThat(file.exists()).isTrue();

    final Pipeline pipeline = TestPipeline.create();

    final PCollection<String> output =
            pipeline.apply(Read.from(CsvWithHeaderFileSource.from(file.getAbsolutePath())));

    DataflowAssert
            .that(output)
            .containsInAnyOrder("a:1, b:2, c:3, ", "a:4, b:5, c:6, ");

    pipeline.run();
}

其中sample.csv具有以下内容:

a,b,c
1,2,3
4,5,6

这篇关于使用数据流读取CSV标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆