索引多个文档并映射到唯一的Solr ID [英] Indexing Multiple documents and mapping to unique solr id

查看:196
本文介绍了索引多个文档并映射到唯一的Solr ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的用例是将2个文件编入索引:元数据文件和二进制PDF文件到唯一的solr id.元数据文件具有XML文件形式的内容,并且某些架构字段已映射到该XML文件中的元素.

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.

我的工作:从PDF文件中提取内容(使用pdftotext),处理该内容并检索特定信息(例如:PDF的第一页/每一行都有有关药物的信息,研究阶段).检索到的信息(医学/研究阶段)需要进行索引,并且应该能够进行搜索/分类/构面.

What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.

我可以使用检索到的信息创建XML文件(我们称其为元数据文件).现在假设我的模式是

I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be

<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>

是否可以将元数据文件和PDF文件放入Solr?

Is there a way to put this metadata file and the PDF file in Solr?

我尝试过的事情:

  1. 基于存档中的建议,我将这些文件压缩并提供给ExtractRequestHandler.我能够将所有内容放入SOLR并使其可搜索.但是它显示为zip文件的内容.(我必须对Solr Code base应用一些补丁才能使其正常工作).但这还不够,因为元数据文件中的内容未映射到字段名称. curl"http://localhost:8983/solr/update/extract?literal.id = doc1& commit = true" -F"myfile=@file.zip"

  1. Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names. curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@file.zip"

我尝试使用DataImportHandler(binURLdatasource).但是我认为我不了解它是如何工作的.所以不能走太远.

I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.

我想到了将元数据标签添加到PDF本身.为此,ExtractrequestHandler应该处理此元数据.我也不确定. 因此,我尝试使用"pdftk"添加元数据.无法向其中添加自定义标签.它仅更新/添加标题/作者/关键字等.有人知道类似的UNIX工具吗?

I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either. So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.

如果有人有提示,请分享. 我想避免创建1个文件(通过合并PDF文本+元数据文件).

If someone has tips, please share. I want to avoid creating 1 file(by merging PDF text + metadata file).

推荐答案

给出文件record1234.pdf和元数据,例如:

Given a file record1234.pdf and metadata like:

<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>

在程序上等同于

curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&"  -F "tutorial=@tutorial.pdf"

改编自 http://wiki.apache.org/solr/ExtractingRequestHandler#Literals.

这将在索引中创建一个新条目,其中包含来自Tika/Solr CEL的text输出以及您指定的字段.

This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.

您应该能够以自己喜欢的语言执行这些操作.

You should be able to perform these operations in your favorite language.

元数据文件中的内容未映射到字段名称

the content in metadata file is not mapped to field names

如果它们未映射到预定义字段,则使用动态字段.例如,您可以将*_i设置为整数字段.

If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.

我想避免创建1个文件(通过合并PDF文本+元数据文件).

I want to avoid creating 1 file(by merging PDF text + metadata file).

这似乎使程序员感到疲劳:-)但是,您有充分的理由吗?

That looks like programmer fatigue :-) But, do you have a good reason?

这篇关于索引多个文档并映射到唯一的Solr ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆