数千个文档(pdf 和/或 xml)的可搜索存档的最佳实践 [英] Best practices for searchable archive of thousands of documents (pdf and/or xml)

查看:19
本文介绍了数千个文档(pdf 和/或 xml)的可搜索存档的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重新审视一个停滞不前的项目并寻求建议,以对数千个旧"文档进行现代化改造并通过网络提供它们.

Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.

文档存在多种格式,有些已经过时:(.docPageMaker、硬拷贝 (OCR)、PDF 等).资金可用于将文档迁移到现代"格式,并且许多硬拷贝已经被 OCR 转换为 PDF - 我们最初假设 PDF 将是最终格式,但我们愿意接受建议(XML?).

Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?).

一旦所有文档都采用通用格式,我们希望使它们的内容可用并且可通过网络界面搜索.我们希望能够灵活地仅返回整个文档中找到搜索命中"的部分(页面?)(我相信 Lucene/elasticsearch 使这成为可能?!?)如果内容都是 XML 会更灵活吗?如果是这样,如何/在哪里存储 XML?直接在数据库中,还是作为文件系统中的离散文件?文档中嵌入的图像/图形怎么样?

Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?

好奇其他人会如何处理这个问题.没有错误"的答案,我只是在寻找尽可能多的输入来帮助我们继续.

Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.

感谢您的建议.

推荐答案

总而言之:我将推荐 ElasticSearch,但让我们把问题分解一下,谈谈如何实现它:

In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:

这有几个部分:

  1. 从您的文档中提取文本以使其可编入索引
  2. 使此文本可用作全文搜索
  3. 返回文档的突出显示片段
  4. 知道在文档中的哪个位置可以找到这些片段用于分页
  5. 返回完整文档

ElasticSearch 可以提供什么:

What can ElasticSearch provide:

  1. ElasticSearch(如 Solr)使用 Tika 从各种文档中提取文本和元数据 格式
  2. 很明显,它提供了强大的全文搜索.可以配置以适当的语言分析每个文档,并使用词干提取、提高某些领域的相关性(例如,标题比内容更重要)、ngram 等,即标准的 Lucene 内容
  3. 它可以为每个搜索结果返回突出显示的片段
  4. 它不知道这些片段出现在您的文档中的哪个位置
  5. 它可以将原始文档存储为附件,或者它可以存储和返回提取的文本.但它会返回整个文档,而不是页面.
  1. ElasticSearch (like Solr) uses Tika to extract text and metadata from a wide variety of doc formats
  2. It, pretty obviously, provides powerful full text search. It can be configured to analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
  3. It can return highlighted snippets for each search result
  4. It DOESN'T know where those snippets occur in your doc
  5. It can store the original doc as an attachment, or it can store and return the extracted text. But it'll return the whole doc, not a page.

您可以将整个文档作为附件发送到 ElasticSearch,然后您将获得全文搜索.但症结在于上面的 (4) 和 (5):知道您在文档中的位置,并返回文档的一部分.

You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

存储单个页面可能足以满足您的 where-am-I 目的(尽管您同样可以下到段落级别),但是您希望它们以一种将在搜索结果中返回文档的方式分组,即使搜索关键字出现在不同的页面上.

Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

首先是索引部分:在 ElasticSearch 中存储您的文档:

First the indexing part: storing your docs in ElasticSearch:

  1. 使用 Tika(或您喜欢的任何工具)从每个文档中提取文本.将其保留为纯文本或 HTML 以保留某些格式.(忘记 XML,不需要它).
  2. 还提取每个文档的元数据:标题、作者、章节、语言、日期等
  3. 将原始文档存储在您的文件系统中,并记录路径以便您以后可以使用
  4. 在 ElasticSearch 中,索引一个doc"文档,其中包含所有元数据,可能还有章节列表
  5. 将每个页面索引为一个页面"文档,其中包含:

  1. Use Tika (or whatever you're comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
  2. Also extract the metadata for each doc: title, authors, chapters, language, dates etc
  3. Store the original doc in your filesystem, and record the path so that you can serve it later
  4. In ElasticSearch, index a "doc" doc which contains all of the metadata, and possibly the list of chapters
  5. Index each page as a "page" doc, which contains:

  • 一个 父字段,其中包含"doc" doc(见下文亲子关系")
  • 正文
  • 页码
  • 也许是章节标题或编号
  • 您想要搜索的任何元数据
  • A parent field which contains the ID of the "doc" doc (see "Parent-child relationship" below)
  • The text
  • The page number
  • Maybe the chapter title or number
  • Any metadata which you want to be searchable

现在进行搜索.您如何执行此操作取决于您希望如何显示结果 - 按页面或按文档分组.

Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.

按页显示结果很容易.此查询返回匹配页面的列表(每个页面都完整返回)以及页面中突出显示的片段列表:

Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "text" : "interesting keywords"
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

显示按doc"分组并突出显示文本的结果有点棘手.它不能通过单个查询完成,但是一个小的客户端分组会让你到达那里.一种方法可能是:

Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:

第 1 步:执行 top-children-query 查找其子项(页面")与查询最匹配的父项(doc"):

Step 1: Do a top-children-query to find the parent ("doc") whose children ("page") best match the query:

curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '
{
   "query" : {
      "top_children" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "score" : "sum",
         "type" : "page",
         "factor" : "5"
      }
   }
}

第 2 步:从上述查询中收集文档"ID 并发出新查询以从匹配的页面"文档中获取片段:

Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "filtered" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "filter" : {
            "terms" : {
               "doc_id" : [ 1,2,3],
            }
         }
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

第 3 步:在您的应用中,将上述查询的结果按文档分组并显示出来.

Step 3: In your app, group the results from the above query by doc and display them.

通过第二个查询的搜索结果,您已经拥有了可以显示的页面全文.要移至下一页,您只需搜索即可:

With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "and" : [
               {
                  "term" : {
                     "doc_id" : 1
                  }
               },
               {
                  "term" : {
                     "page" : 2
                  }
               }
            ]
         }
      }
   },
   "size" : 1
}
'

或者,给页面"文档一个由 $doc_id _ $page_num 组成的 ID(例如 123_2),然后您就可以检索该页面:

Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num (eg 123_2) then you can just retrieve that page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2

亲子关系:

通常,在 ES(和大多数 NoSQL 解决方案)中,每个文档/对象都是独立的 - 没有真正的关系.通过在文档"和页面"之间建立父子关系,ElasticSearch 确保子文档(即页面")与父文档(文档")存储在同一个分片上.

Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").

这使您能够运行 top-children-查询将根据页面"的内容找到最匹配的文档".

This enables you to run the top-children-query which will find the best matching "doc" based on the content of the "pages".

这篇关于数千个文档(pdf 和/或 xml)的可搜索存档的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆