使用 Dataflow 读取 CSV 标头 [英] Reading CSV header with Dataflow
问题描述
我有一个 CSV 文件,但我提前不知道列名.我需要在 Google Dataflow 中进行一些转换后以 JSON 格式输出数据.
I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow.
获取标题行并将标签渗透到所有行的最佳方法是什么?
What's the best way to take the header row and permeate the labels through all the rows?
例如:
a,b,c
1,2,3
4,5,6
...变成(大约):
{a:1, b:2, c:3}
{a:4, b:5, c:6}
推荐答案
您应该实现自定义 FileBasedSource(类似于 TextIO.TextSource),它将读取第一行并存储标题数据
You should implement custom FileBasedSource (similar to TextIO.TextSource), that will read the first line and store header data
@Override
protected void startReading(final ReadableByteChannel channel)
throws IOException {
lineReader = new LineReader(channel);
if (lineReader.readNextLine()) {
final String headerLine = lineReader.getCurrent().trim();
header = headerLine.split(",");
readingStarted = true;
}
}
和后者,在读取其他行时将其添加到当前行数据中:
and latter, while reading other lines prepend it to current line data:
@Override
protected boolean readNextRecord() throws IOException {
if (!lineReader.readNextLine()) {
return false;
}
final String line = lineReader.getCurrent();
final String[] data = line.split(",");
// assumes all lines are valid
final StringBuilder record = new StringBuilder();
for (int i = 0; i < header.length; i++) {
record.append(header[i]).append(":").append(data[i]).append(", ");
}
currentRecord = record.toString();
return true;
}
我已经实施了一个快速(完整)的解决方案,可在 github 上获得.我还添加了一个数据流单元测试来演示阅读:
I've implemented a quick (complete) solution, available on github. I also added a dataflow unit test to demonstrate reading:
@Test
public void test_reading() throws Exception {
final File file =
new File(getClass().getResource("/sample.csv").toURI());
assertThat(file.exists()).isTrue();
final Pipeline pipeline = TestPipeline.create();
final PCollection<String> output =
pipeline.apply(Read.from(CsvWithHeaderFileSource.from(file.getAbsolutePath())));
DataflowAssert
.that(output)
.containsInAnyOrder("a:1, b:2, c:3, ", "a:4, b:5, c:6, ");
pipeline.run();
}
其中 sample.csv
具有以下内容:
where sample.csv
has following content:
a,b,c
1,2,3
4,5,6
这篇关于使用 Dataflow 读取 CSV 标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!