将元数据存储到Jackrabbit存储库中 [英] Store metadata into Jackrabbit repository

查看:162
本文介绍了将元数据存储到Jackrabbit存储库中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  1. 接收文件(MS docs,ODS,PDF )



  2. 通过Apache Tika +内容提取的公开核心元数据提取

    使用Jackrabbit将文档(内容)与其元数据一起存储到存储库



我对第3点和第4点感兴趣...



DETAILS:
应用程序正在以交互方式处理文档(一些分析 - 语言检测,字数统计等+尽可能多地收集细节 - 都柏林核心+解析内容/事件处理)它将处理结果返回给用户,然后将提取的内容和元数据(提取的和自定义用户元数据)存储到JCR库中



感谢任何帮助, p>

解决方案

上传文件对于JCR 2.0和JCR 1.0基本相同。但是,JCR 2.0添加了一些有用的内置属性定义。



nt:file节点类型旨在表示一个文件, - JCR 2.0中的属性定义(两者都是在创建节点时由存储库自动创建的):




  • jcr:created DATE)

  • jcr:createdBy(STRING)



,并定义一个名为jcr :内容。这个jcr:content节点可以是任何节点类型,但是一般来说,与内容本身有关的所有信息都存储在该子节点上。事实上的标准是使用nt:resource节点类型,它具有定义的这些属性:




  • jcr:data(BINARY)强制

  • jcr:lastModified(DATE)自动创建

  • jcr:lastModifiedBy(STRING)自动创建



$ b

请注意,mimeType(STRING)在JCR 2.0中添加了jcr:mimeType和jcr:encoding。



特别地,jcr:mimeType属性的目的是你要求 - 捕获内容的类型。但是,jcr:mimeType和jcr:encoding属性定义(由JCR实现)可以定义为protected(意味着JCR实现自动设置它们) - 如果是这种情况,则不允许手动设置这些属性。我认为 Jackrabbit ModeShape 不会将这些视为



这里是一些代码,显示如何使用这些内置的节点类型将文件上传到JCR 2.0存储库:

  //获取文件的输入流... 
文件file = ...
InputStream stream = new BufferedInputStream(new FileInputStream文件));

节点文件夹= session.getNode(/ absolute / path / to / folder / node);
节点文件= folder.addNode(Article.pdf,nt:file);
节点内容= file.addNode(jcr:content,nt:resource);
Binary binary = session.getValueFactory()。createBinary(stream);
content.setProperty(jcr:data,binary);

如果JCR实现不将jcr:mimeType属性视为受保护的和ModeShape),您必须手动设置此属性:

  content.setProperty(jcr:mimeType,application / pdf); 

元数据可以很容易地存储在nt:file和jcr:content但是开箱即用的nt:file和nt:resource节点类型不允许额外的属性。因此,在您可以添加其他属性之前,您首先需要添加一个mixin(或多个mixin),它们具有要存储的属性类型的属性定义。你甚至可以定义一个允许任何属性的mixin。这里是一个定义这样一个mixin的CND文件:

 < custom ='http://example.com/mydomain'> ; 
[custom:extensible] mixin
- *(undefined)multiple
- *(未定义)

注册此节点类型定义后,您可以在节点上使用:

 内容。 addMixin(custom:extensible); 
content.setProperty(anyProp,some value);
content.setProperty(custom:otherProp,other other value);

您还可以定义和使用一个允许任何 Dublin核心元素

 < dc ='http: /purl.org/dc/elements/1.1/'> 
[dc:metadata] mixin
- dc:contributor(STRING)
- dc:coverage(STRING)
- dc:creator(STRING)
- dc: date(DATE)
- dc:description(STRING)
- dc:format(STRING)
- dc:identifier(STRING)
- dc:language $ b - dc:publisher(STRING)
- dc:relation(STRING)
- dc:right(STRING)
- dc:source(STRING)
- dc:subject (STRING)
- dc:title(STRING)
- dc:type(STRING)


$ b b

所有这些属性都是可选的,并且此mixin不允许任何名称或类型的属性。我也没有真正解决这个'dc:元数据'混合的事实,其中一些已经表示与内置的属性(例如,jcr:createBy,jcr:lastModifiedBy,jcr:created ,jcr:lastModified,jcr:mimeType),并且其中一些可能与内容更相关,而其他更多与文件相关。



课程定义更适合您的元数据需要的其他mixin,在需要的地方使用继承。但是要小心地使用mixins继承 - 因为JCR允许一个节点使用多个mixins,通常最好是设计你的mixins是严格的scoped和面向导(例如,ex:taggable,ex:describeable然后根据需要简单地将适当的mixin应用到一个节点。



(甚至可能,更复杂的是定义一个mixin,允许更多的子:file节点,并在其中存储一些元数据。)



混合是太棒了,给你的JCR内容提供了巨大的灵活性和力量。



当你创建了所有你想要的节点时,一定要保存会话:

  session.save(); 


can anybody explain to me, how to proceed in following scenario ?

  1. receiving documents (MS docs, ODS, PDF)

  2. Dublic core metadata extraction via Apache Tika + content extraction via jackrabbit-content-extractors

  3. using Jackrabbit to store documents (content) into repository together with their metadata ?

  4. retrieving documents + metadata

I'm interested in points 3 and 4 ...

DETAILS: The application is processing documents interactively (some analysis - language detection, word count etc. + gather as many details possible - Dublin core + parsing the content/events handling) so that it returns results of the processing to the user and then the extracted content and metadata(extracted and custom user metadata) stores into JCR repository

Appreciate any helps, thank you

解决方案

Uploading files is basically the same for JCR 2.0 as it is for JCR 1.0. However, JCR 2.0 adds a few additional built-in property definitions that are useful.

The "nt:file" node type is intended to represent a file and has two built-in property definitions in JCR 2.0 (both of which are auto-created by the repository when nodes are created):

  • jcr:created (DATE)
  • jcr:createdBy (STRING)

and defines a single child named "jcr:content". This "jcr:content" node can be of any node type, but generally speaking all information pertaining to the content itself is stored on this child node. The de facto standard is to use the "nt:resource" node type, which has these properties defined:

  • jcr:data (BINARY) mandatory
  • jcr:lastModified (DATE) autocreated
  • jcr:lastModifiedBy (STRING) autocreated
  • jcr:mimeType (STRING) protected?
  • jcr:encoding (STRING) protected?

Note that "jcr:mimeType" and "jcr:encoding" were added in JCR 2.0.

In particular, the purpose of the "jcr:mimeType" property was to do exactly what you're asking for - capture the "type" of the content. However, the "jcr:mimeType" and "jcr:encoding" property definitions can be defined (by the JCR implementation) as protected (meaning the JCR implementation automatically sets them) - if this is the case, you would not be allowed to manually set these properties. I believe that Jackrabbit and ModeShape do not treat these as protected.

Here is some code that shows how to upload a file into a JCR 2.0 repository using these built-in node types:

// Get an input stream for the file ...
File file = ...
InputStream stream = new BufferedInputStream(new FileInputStream(file));

Node folder = session.getNode("/absolute/path/to/folder/node");
Node file = folder.addNode("Article.pdf","nt:file");
Node content = file.addNode("jcr:content","nt:resource");
Binary binary = session.getValueFactory().createBinary(stream);
content.setProperty("jcr:data",binary);

And if the JCR implementation does not treat the "jcr:mimeType" property as protected (i.e., Jackrabbit and ModeShape), you'd have to set this property manually:

content.setProperty("jcr:mimeType","application/pdf");

Metadata can very easily be stored on the "nt:file" and "jcr:content" nodes, but out-of-the-box the "nt:file" and "nt:resource" node types don't allow for extra properties. So before you can add other properties, you first need to add a mixin (or multiple mixins) that have property definitions for the kinds of properties you want to store. You can even define a mixin that would allow any property. Here is a CND file defining such a mixin:

<custom = 'http://example.com/mydomain'>
[custom:extensible] mixin
- * (undefined) multiple 
- * (undefined) 

After registering this node type definition, you can then use this on your nodes:

content.addMixin("custom:extensible");
content.setProperty("anyProp","some value");
content.setProperty("custom:otherProp","some other value");

You could also define and use a mixin that allowed for any Dublin Core element:

<dc = 'http://purl.org/dc/elements/1.1/'>
[dc:metadata] mixin
- dc:contributor (STRING)
- dc:coverage (STRING)
- dc:creator (STRING)
- dc:date (DATE)
- dc:description (STRING)
- dc:format (STRING)
- dc:identifier (STRING)
- dc:language (STRING)
- dc:publisher (STRING)
- dc:relation (STRING)
- dc:right (STRING)
- dc:source (STRING)
- dc:subject (STRING)
- dc:title (STRING)
- dc:type (STRING)

All of these properties are optional, and this mixin doesn't allow for properties of any name or type. I've also not really addressed with this 'dc:metadata' mixin the fact that some of these are already represented with the built-in properties (e.g., "jcr:createBy", "jcr:lastModifiedBy", "jcr:created", "jcr:lastModified", "jcr:mimeType") and that some of them may be more related to content while others more related to the file.

You could of course define other mixins that better suit your metadata needs, using inheritance where needed. But be careful using inheritance with mixins - since JCR allows a node to multiple mixins, it's often best to design your mixins to be tightly scoped and facet-oriented (e.g., "ex:taggable", "ex:describable", etc.) and then simply apply the appropriate mixins to a node as needed.

(It's even possible, though much more complicated, to define a mixin that allows more children under the "nt:file" nodes, and to store some metadata there.)

Mixins are fantastic and give a tremendous amount of flexibility and power to your JCR content.

Oh, and when you've created all of the nodes you want, be sure to save the session:

session.save();

这篇关于将元数据存储到Jackrabbit存储库中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆