索引多个文档并映射到唯一的 solr id [英] Indexing Multiple documents and mapping to unique solr id

查看:11
本文介绍了索引多个文档并映射到唯一的 solr id的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的用例是将 2 个文件索引:元数据文件和二进制 PDF 文件到唯一的 solr id.元数据文件具有 XML 文件形式的内容,并且一些架构字段映射到该 XML 文件中的元素.

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.

我的工作:从 PDF 文件中提取内容(使用 pdftotext),处理该内容并检索特定信息(例如:PDF 的第一页/行包含有关药物、研究阶段的信息).检索到的信息(医学/研究阶段)需要编入索引,并且应该能够搜索/排序/分面.

What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.

我可以创建一个包含检索信息的 XML 文件(我们称之为元数据文件).现在假设我的架构是

I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be

<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>

有没有办法把这个元数据文件和 PDF 文件放到 Solr 中?

Is there a way to put this metadata file and the PDF file in Solr?

我尝试过的:

  1. 根据档案中的建议,我压缩了这些文件并提供给 ExtractRequestHandler.我能够将所有内容放在 SOLR 中并使其可搜索.但它显示为 zip 文件的内容.(我必须对 Solr 代码库应用一些补丁才能使其工作).但这还不够,因为元数据文件中的内容没有映射​​到字段名称.curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@file.zip"

  1. Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names. curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@file.zip"

我尝试使用 DataImportHandler(binURLdatasource).但我想我不明白它是如何工作的.所以走不了多远.

I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.

我想为 PDF 本身添加元数据标签.为此, ExtractrequestHandler 应处理此元数据.我也不确定.所以我尝试pdftk"来添加元数据.无法为其添加自定义标签.它只更新/添加标题/作者/关键字等.有没有人知道类似的unix工具.

I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either. So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.

如果有人有提示,请分享.我想避免创建 1 个文件(通过合并 PDF 文本 + 元数据文件).

If someone has tips, please share. I want to avoid creating 1 file(by merging PDF text + metadata file).

推荐答案

给定一个文件 record1234.pdf 和如下元数据:

Given a file record1234.pdf and metadata like:

<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>

执行与

curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&"  -F "tutorial=@tutorial.pdf"

改编自 http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .

这将在索引中创建一个新条目,其中包含来自 Tika/Solr CEL 的 text 输出以及您指定的字段.

This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.

您应该能够用自己喜欢的语言执行这些操作.

You should be able to perform these operations in your favorite language.

元数据文件中的内容未映射到字段名称

the content in metadata file is not mapped to field names

如果它们不映射到预定义的字段,则使用动态字段.例如,您可以将 *_i 设置为整数字段.

If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.

我想避免创建 1 个文件(通过合并 PDF 文本 + 元数据文件).

I want to avoid creating 1 file(by merging PDF text + metadata file).

这看起来像是程序员疲劳 :-) 但是,你有充分的理由吗?

That looks like programmer fatigue :-) But, do you have a good reason?

这篇关于索引多个文档并映射到唯一的 solr id的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆