从光束管道写入tfrecords? [英] Write tfrecords from beam pipeline?

查看:72
本文介绍了从光束管道写入tfrecords?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些Map格式的数据,我想使用光束管道将它们转换为tfrecords. 这是我尝试编写代码的尝试.我曾尝试在python中工作,但我需要在java中实现此工作,因为那里存在无法移植到python的业务逻辑.可以在此问题中找到相应的有效python实现.

I have some data in Map format and I want to convert them to tfrecords, using the beam pipeline. Here is my attempt to write the code. I have attempted this in python which works but I need to implement this in java as some business logic is there which I can't port to python. The corresponding working python implementation can be found here in this question.

import com.google.protobuf.ByteString;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.protobuf.ProtoCoder;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.commons.lang3.RandomStringUtils;
import org.tensorflow.example.BytesList;
import org.tensorflow.example.Example;
import org.tensorflow.example.Feature;
import org.tensorflow.example.Features;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class Sample {

    static class Foo extends DoFn<Map<String, String>, Example> {

        public static Feature stringToFeature(String value) {
            ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
            BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
            return Feature.newBuilder().setBytesList(bytesList).build();
        }

        public void processElement(@Element Map<String, String> element, OutputReceiver<Example> receiver) {

            Features features = Features.newBuilder()
                    .putFeature("foo", stringToFeature(element.get("foo")))
                    .putFeature("bar", stringToFeature(element.get("bar")))
                    .build();

            Example example = Example
                    .newBuilder()
                    .setFeatures(features)
                    .build();

            receiver.output(example);
        }

    }

    private static Map<String, String> generateRecord() {
        String[] keys = {"foo", "bar"};
        return IntStream.range(0,keys.length)
                .boxed()
                .collect(Collectors
                        .toMap(i -> keys[i],
                                i -> RandomStringUtils.randomAlphabetic(8)));
    }

    public static void main(String[] args) {

        List<Map<String, String>> records = new ArrayList<>();
        for (int i=0; i<10; i++) {
            records.add(generateRecord());
        }

        System.out.println(records);
        Pipeline p = Pipeline.create();

        p.apply("Input creation", Create.of(records))
                .apply("Encode to Exampple", ParDo.of(new Foo())).setCoder(ProtoCoder.of(Example.class))
                .apply("Write to disk",
                        TFRecordIO.write()
                                .to("output")
                                .withNumShards(2)
                                .withSuffix(".tfrecord"));

        p.run();


    }
}

对于上面的代码,我在编译时遇到以下错误

For the above code I am getting the following error at compile time

Error:(70, 17) java: no suitable method found for apply(java.lang.String,org.apache.beam.sdk.io.TFRecordIO.Write)
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (actual and formal argument lists differ in length))
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(java.lang.String,org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (argument mismatch; org.apache.beam.sdk.io.TFRecordIO.Write cannot be converted to org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>))

推荐答案

TFRecordIO.write()的输入应为byte[],因此进行以下更改对我有用.

input to TFRecordIO.write() should be byte[] so making following changes worked for me.

static class Foo extends DoFn<Map<String, String>, byte[]> {

    public static Feature stringToFeature(String value) {
        ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
        BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
        return Feature.newBuilder().setBytesList(bytesList).build();
    }

    public void processElement(@Element Map<String, String> element, OutputReceiver<byte[]> receiver) {

        Features features = Features.newBuilder()
                .putFeature("foo", stringToFeature(element.get("foo")))
                .putFeature("bar", stringToFeature(element.get("bar")))
                .build();

        Example example = Example
                .newBuilder()
                .setFeatures(features)
                .build();

        receiver.output(example.toByteArray());
    }

}

这篇关于从光束管道写入tfrecords?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆