如何根据给定的.proto编写有效的解码文件，从.pb读取 [英] how to write a valid decoding file based on a given .proto, reading from a .pb

查看：398 发布时间：2018/12/10 10:33:14 java protocol-buffers

本文介绍了如何根据给定的.proto编写有效的解码文件，从.pb读取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基于对此问题的回答我在想我已经为我的.pb文件提供了错误的解码器。

Based on the answer to this question I'm thinking that I've provided my .pb file with a "faulty decoder".

基于 ListPeople.java 示例 Java教程文档，我试着写一些类似于开始挑选数据的东西，我写道：

Based on the ListPeople.java example provided in the Java tutorial documentation, I tried to write something similar to start picking apart that data, I wrote this:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;


public class ListDocument
{
    // Iterates though all people in the AddressBook and prints info about them.
    static void Print(Document document)
    {
        for ( Sentence sentence: document.getSentencesList() )
        {
            for(int i=0; i < sentence.getTokensCount(); i++)
            {
                System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
            }
        }
    }

    // Main function:  Reads the entire address book from a file and prints all
    //   the information inside.
    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.err.println("Usage:  ListPeople ADDRESS_BOOK_FILE");
            System.exit(-1);
        }

        // Read the existing address book.
        Document addressBook =
                Document.parseFrom(new FileInputStream(args[0]));

        Print(addressBook);
    }
}

但是当我运行那个时我t此错误

But when I run that I get this error

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
    at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
    at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
    at ListDocument.main(ListDocument.java:40)

所以，as我上面说过，我认为这与我没有正确定义解码器有关。有没有办法查看我正在尝试使用的.proto文件，并找出一种方法来读取所有数据？

so, as I said above I think that has to do with me not properly defining the decoder. Is there some way to look at the .proto file I'm trying to use and figure out a way to just read off all that data?

有没有办法看看那个.proto文件，看看我做错了什么？

Is there some way to look at that .proto file and see what I'm doing wrong?

这些是我想读的文件的前几行：

These are the first few lines of the file I want to read:

Ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö

S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*ŒThe occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."¬
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb†€€€øÿÿÿÿ…€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù

编辑

这是另一位研究人员使用的文件解析这些fi les，所以我被告知，我有可能使用它吗？

This is a file another researcher used to parse these files, so I was told, is it possible that I could use this?

package edu.stanford.nlp.kbp.slotfilling.multir;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;

/**
 * Converts Hoffmann's data in protobuf format to our MultiLabelDataset
 * @author Mihai
 *
 */
public class ProtobufToMultiLabelDataset {
  static class RelationAndMentions {
    String arg1;
    String arg2;
    Set<String> posLabels;
    Set<String> negLabels;
    List<Mention> mentions;

    public RelationAndMentions(String types, String a1, String a2) {
      arg1 = a1;
      arg2 = a2;
      String [] rels = types.split(",");
      posLabels = new HashSet<String>();
      for(String r: rels){
        if(! r.equals("NA")) posLabels.add(r.trim());
      }
      negLabels = new HashSet<String>(); // will be populated later
      mentions = new ArrayList<Mention>();
    }
  };

  static class Mention {
    List<String> features;
    public Mention(List<String> feats) {
      features = feats;
    }
  }

    public static void main(String[] args) throws Exception {
      String input = args[0];

      InputStream is = new GZIPInputStream(
        new BufferedInputStream
        (new FileInputStream(input)));

      toMultiLabelDataset(is);
      is.close();
    }

    public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, true);
      MultiLabelDataset<String, String> dataset = toDataset(relations);
      return dataset;
    }

    public static void toDatums(InputStream is,
        List<List<Collection<String>>> relationFeatures,
        List<Set<String>> labels) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, false);
      toDatums(relations, relationFeatures, labels);
    }

    private static void toDatums(List<RelationAndMentions> relations,
        List<List<Collection<String>>> relationFeatures,
      List<Set<String>> labels) {
    for(RelationAndMentions rel: relations) {
      labels.add(rel.posLabels);
      List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
      for(int i = 0; i < rel.mentions.size(); i ++){
        mentionFeatures.add(rel.mentions.get(i).features);
      }
      relationFeatures.add(mentionFeatures);
    }
    assert(labels.size() == relationFeatures.size());
    }

    public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException {
      //
      // Parse the protobuf
      //
    // all relations are stored here
    List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
    // all known relations (without NIL)
    Set<String> relTypes = new HashSet<String>();
    Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
      new HashMap<String, Map<String,Set<String>>>();
    Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
    Relation r = null;
    while ((r = Relation.parseDelimitedFrom(is)) != null) {
      RelationAndMentions relation = new RelationAndMentions(
          r.getRelType(), r.getSourceGuid(), r.getDestGuid());
      labelCountHisto.incrementCount(relation.posLabels.size());
      relTypes.addAll(relation.posLabels);
      relations.add(relation);

      for(int i = 0; i < r.getMentionCount(); i ++) {
        DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
        // String s = mention.getSentence();
        relation.mentions.add(new Mention(mention.getFeatureList()));
      }

      for(String l: relation.posLabels) {
        addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
      }
    }
    Log.severe("Loaded " + relations.size() + " relations.");
    Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
    Log.severe("Label count histogram: " + labelCountHisto);

    Counter<Integer> slotCountHisto = new ClassicCounter<Integer>();
    for(String e: knownRelationsPerEntity.keySet()) {
      slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size());
    }
    Log.severe("Slot count histogram: " + slotCountHisto);
    int negativesWithKnownPositivesCount = 0, totalNegatives = 0;
    for(RelationAndMentions rel: relations) {
      if(rel.posLabels.size() == 0) {
        if(knownRelationsPerEntity.get(rel.arg1) != null &&
           knownRelationsPerEntity.get(rel.arg1).size() > 0) {
          negativesWithKnownPositivesCount ++;
        }
        totalNegatives ++;
      }
    }
    Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives +
        " negative examples with at least one known relation for arg1.");

    Counter<Integer> mentionCountHisto = new ClassicCounter<Integer>();
    for(RelationAndMentions rel: relations) {
      mentionCountHisto.incrementCount(rel.mentions.size());
      if(rel.mentions.size() > 100)
        Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels);
    }
    Log.severe("Mention count histogram: " + mentionCountHisto);

    //
    // Detect the known negatives for each source entity
    //
    if(generateNegativeLabels) {
      for(RelationAndMentions rel: relations) {
        Set<String> negatives = new HashSet<String>(relTypes);
        negatives.removeAll(rel.posLabels);
        rel.negLabels = negatives;
      }
    }

    return relations;
    }

    private static MultiLabelDataset<String, String> toDataset(List<RelationAndMentions> relations) {
    int [][][] data = new int[relations.size()][][];
    Index<String> featureIndex = new HashIndex<String>();
    Index<String> labelIndex = new HashIndex<String>();
    Set<Integer> [] posLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
    Set<Integer> [] negLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);

    int offset = 0, posCount = 0;
    for(RelationAndMentions rel: relations) {
      Set<Integer> pos = new HashSet<Integer>();
      Set<Integer> neg = new HashSet<Integer>();
      for(String l: rel.posLabels) {
        pos.add(labelIndex.indexOf(l, true));
      }
      for(String l: rel.negLabels) {
        neg.add(labelIndex.indexOf(l, true));
      }
      posLabels[offset] = pos;
      negLabels[offset] = neg;
      int [][] group = new int[rel.mentions.size()][];
      for(int i = 0; i < rel.mentions.size(); i ++){
        List<String> sfeats = rel.mentions.get(i).features;
        int [] features = new int[sfeats.size()];
        for(int j = 0; j < sfeats.size(); j ++) {
          features[j] = featureIndex.indexOf(sfeats.get(j), true);
        }
        group[i] = features;
      }
      data[offset] = group;
      posCount += posLabels[offset].size();
      offset ++;
    }

    Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive.");
    MultiLabelDataset<String, String> dataset = new MultiLabelDataset<String, String>(
        data, featureIndex, labelIndex, posLabels, negLabels);
    return dataset;
    }

    private static void addKnownRelation(String arg1, String arg2, String label,
        Map<String, Map<String, Set<String>>> knownRelationsPerEntity) {
      Map<String, Set<String>> myRels = knownRelationsPerEntity.get(arg1);
      if(myRels == null) {
        myRels = new HashMap<String, Set<String>>();
        knownRelationsPerEntity.put(arg1, myRels);
      }
      Set<String> mySlots = myRels.get(label);
      if(mySlots == null) {
        mySlots = new HashSet<String>();
        myRels.put(label, mySlots);
      }
      mySlots.add(arg2);
    }
}

推荐答案

更新;这里的困惑是两点：

Updated; the confusion here is two points:

根对象是关系，而不是文件（实际上，只有关系和 RelationMentionRef 甚至被使用）

pb文件实际上是多个对象，每个对象以varint分隔，即以长度为前缀，表示为varint

the root object is Relation, not Document (in fact, only Relation and RelationMentionRef are even used)
the pb file is actually multiple objects, each varint-delimited, i.e. prefixed by their length expressed as a varint

因此， Relation.parseDelimitedFrom 应该有效。手动处理，我得到：

As such, Relation.parseDelimitedFrom should work. Processing it manually, I get:

test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed

旧;过时的;探索性的：

Old; outdated; exploratory:

我提取了你的4份文件并通过一个小试验台运行：

I extracted your 4 documents and ran them through a little test rig:

        ProcessFile("testNegative.pb");
        ProcessFile("testPositive.pb");
        ProcessFile("trainNegative.pb");
        ProcessFile("trainPositive.pb");

其中 ProcessFile 首先转储前10个字节作为十六进制，然后尝试通过 ProtoReader 处理它。结果如下：

where ProcessFile first dumps the first 10 bytes as hex, and then tries to process it via a ProtoReader. Here's the results:

Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

是的;同意; DC是线型4（端组），场27;你的文档没有定义字段27，即使它确实如此：从一个端组开始是没有意义的。

Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.

Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

这里我们看不出有问题十六进制转储中的数据，但同样：初始字段看起来与您的数据完全不同，读者很容易确认数据已损坏。

Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.

Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

与上述相同。

Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://stackoverflow.com/q/2152978/23354

CF 75是一个双字节带有线型7的varint（在规范中没有定义）。

CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).

你的数据非常真实垃圾。对不起。

Your data is well and truly garbage. Sorry.

还有来自评论的奖励回合test-multiple.pb（在gz解压缩后）：

And with the bonus round of test-multiple.pb from comments (after gz decompression):

Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

这与testNegative.pb完全相同，因为完全相同的原因而失败。

This starts identically to testNegative.pb, and hence fails for exactly the same reason.

这篇关于如何根据给定的.proto编写有效的解码文件，从.pb读取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何根据给定的.proto编写有效的解码文件，从.pb读取 [英] how to write a valid decoding file based on a given .proto, reading from a .pb

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何根据给定的.proto编写有效的解码文件，从.pb读取 [英] how to write a valid decoding file based on a given .proto, reading from a .pb

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭