从GATE数据存储区读取带注释的数据 [英] Read annotated data from GATE datastore

查看:193
本文介绍了从GATE数据存储区读取带注释的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用GATE通过其包含的情感手动注释大量文本。为了进一步处理这个文本,我喜欢将它从数据存储区导出到我自己的Java应用程序中。我没有找到关于如何做到这一点的文档。我已经编写了一个将数据导入数据存储区的程序,但我不知道如何从数据存储区中取出注释。我还尝试使用Luke打开基于lucene的数据存储区( https://code.google.com/p/路加/ )。它是一个能够读取Lucene索引的工具。但是不可能使用该工具打开Gate Lucene数据存储区:(有没有人知道如何从数据存储区读取带注释的文本?

I use GATE for manually annotating a large amount of texts by its contained emotions. To further process this text, I like to export that out of the datastore into my own Java application. I didn't found documentation about how to do that. I already wrote a program to import data into the datastore, but I don't have an idea how to get the annotated out of the datastore. I also tried to open the lucene based datastore using Luke (https://code.google.com/p/luke/). It's a tool, that is able to read a Lucene index. But it was not possible to open the Gate Lucene datastore using that tool :( Does anyone has an idea how to read the annotated text from the datastore?

推荐答案

您可以使用GATE API从数据存储区加载文档,然后以正常方式将它们导出为GATE XML(省略导入和异常处理):

You can use GATE APIs to load the documents from the datastore and then export them as GATE XML in the normal way (imports and exception handling omitted):

Gate.init();
DataStore ds = Factory.openDataStore("gate.creole.annic.SearchableDataStore", "file:/path/to/datastore");
List docIds = ds.getLrIds("gate.corpora.DocumentImpl");
for(Object id : docIds) {
  Document d = (Document)Factory.createResource("gate.corpora.DocumentImpl",
            gate.Utils.featureMap(DataStore.DATASTORE_FEATURE_NAME, ds,
                                  DataStore.LR_ID_FEATURE_NAME, id));
  try {
    File outputFile = new File(...); // based on doc name, sequential number, etc.
    DocumentStaxUtils.writeDocument(d, outputFile);
  } finally {
    Factory.deleteResource(d);
  }
}

如果你想把注释写成内联XML那么将 DocumentStaxUtils.writeDocument 替换为

If you want to write the annotations as inline XML then replace DocumentStaxUtils.writeDocument with something like

Set<String> types = new HashSet<String>();
types.add("Person");
types.add("Location"); // and whatever others you're interested in
FileUtils.write(outputFile, d.toXml(d.getAnnotations().get(types), true));

(我正在使用来自Apache commons-io的FileUtils 为方便起见,你可以同样处理打开和关闭提交自己)。

(I'm using FileUtils from Apache commons-io for convenience but you could equally handle opening and closing the file yourself).

这篇关于从GATE数据存储区读取带注释的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆