成千上万个文档的可搜索归档的最佳做法(pdf和/或xml) [英] Best practices for searchable archive of thousands of documents (pdf and/or xml)

查看:200
本文介绍了成千上万个文档的可搜索归档的最佳做法(pdf和/或xml)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



文件以各种格式存在,有些已过时: ( .doc PageMaker ,硬拷贝(OCR), PDF 等)。资金可用于将文件迁移到现代格式,许多硬拷贝已经被OCR转换成PDF格式 - 我们最初认为PDF将是最终格式,但是我们可以接受建议(XML?) 。



一旦所有文档都采用通用格式,我们希望通过Web界面使其内容可用,并可通过搜索。我们希望灵活性只返回整个文档的搜索hit的部分(页面?)(我相信Lucene / elasticsearch可以实现这一点?)?如果内容是全部XML,可能会更灵活吗?如果是这样/在哪里存储XML?直接在数据库中,还是作为文件系统中的离散文件?文件中的嵌入图像/图形怎么样?



好奇别人怎么会这样做没有错误的答案我只是想尽可能多的输入,以帮助我们继续。



感谢任何建议。

解决方案

总结:我将要推荐 ElasticSearch ,但是让我们解决问题,并谈谈如何实现它:



这里有几个部分:


  1. 从您的文档中提取文本以使其可索引

  2. 将此文本作为全文搜索

  3. 返回文档的突出显示的片段

  4. 知道文档中哪些片段被发现允许
    分页

  5. 返回完整的文档

ElasticSearch可以提供什么:


  1. ElasticSearch(如Solr)使用 Tika 从各种文档中提取文本和元数据格式

  2. 很明显,提供强大的全文搜索。可以配置
    ,以适当的语言分析每个文档,从而提高某些字段的相关性(例如标题比内容更重要),ngram等。即标准Lucene的东西

  3. 可以为每个搜索结果返回突出显示的片段

  4. 它不知道这些代码段在您的文档中的位置。

  5. 它可以将原始文档存储为附件,或者它可以存储并返回提取的文本。但是它会返回整个文档,而不是一个页面。

您可以将整个文档作为附件发送到ElasticSearch,你会得到全文搜索。但是上述的(4)和(5)的要点是:知道你在一个文档中的位置,并返回文档的一部分。



存储单个页面可能就足够了对于你的目的(尽管你可以同样地下降到段落级别),但是您希望它们以搜索结果中的文档返回的方式进行分组,即使搜索关键字出现在不同的页面上。 p>

首先索引部分:将您的文档存储在ElasticSearch中:


  1. 使用Tika无论你喜欢什么)从每个文档中提取文本。将其作为纯文本或HTML格式保留一些格式。 (忘记XML,不需要它)。

  2. 还提取每个文档的元数据:标题,作者,章节,语言,日期等

  3. 将原始文档存储在文件系统中,并记录路径以便以后可以提供。

  4. 在ElasticSearch中,索引一个包含所有元数据的doc文档,可能章节列表

  5. 将每个页面作为页面文档进行索引,其中包含:




    • 父字段,其中包含doc的ID doc(见下面的亲子关系)

    • 文字

    • 页码

    • 可能是章节标题或号码

    • 您想要搜索的任何元数据


现在搜索。你如何做到这一点取决于你想如何呈现你的结果 - 按页面或按文档分组。



按页面排列的结果很简单。此查询返回匹配页面的列表(每个页面都将完整返回)以及页面中突出显示的片段列表:

 卷曲-XGET'http://127.0.0.1:9200/my_index/page/_search?pretty=1'-d'
{
查询:{
text:{
text:有趣的关键字
}
},
highlight:{
fields:{
text }
}
}
}
'

用doc分组显示的结果与文本的亮点有点棘手。它不能用单个查询完成,但是有一点客户端分组会让你在那里。一种方法可能是:



步骤1:执行 top-children-query 找到其孩子(page)与查询最匹配的父(doc):

  curl -XGET'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'-d'
{
query:{
top_children:{
query:{
text:{
text:有趣的关键字

},
score:sum,
type:page,
factor:5
}
}
}

步骤2:从上述查询中收集doc发出新的查询以从匹配的页面文档中获取代码段:

  curl -XGET'http://127.0。 0.1:9200 / my_index / page / _search?pretty = 1'-d'
{
query:{
filtered:{
查询:{
text:{
text:有趣的关键字
}
},
过滤器:{
terms:{
doc_id:[1,2,3],
}
}
}
},
highlight {
fields:{
text:{}
}
}
}
'

步骤3:在您的应用程序中,通过doc对上述查询的结果进行分组并显示。



使用第二个查询的搜索结果,您已经有可以显示的页面的全文。要转到下一页,您可以搜索:

  curl -XGET'http://127.0.0.1: 9200 / my_index / page / _search?pretty = 1'-d'
{
查询:{
constant_score:{
filter:{$ b $ b
$ bb



期限:{
页:2
}
}
]
}
}
},
:1
}
'

或者,给文档ID由$ code> $ doc_id _ $ page_num (例如123_2)组成,那么您只需检索该页面:

  curl -XGET'http://127.0.0.1:9200/my_index/page/123_2 

亲子关系:



通常,在ES(和大多数NoSQL解决方案)中,每个文档/对象都是独立的 - 没有真正的关系。通过在doc和page之间建立父子关系,ElasticSearch确保子文档(即页)存储在与父文档(doc)相同的分片上。



这使您能够运行 top-children-query ,根据页面的内容,将找到最符合的doc。


Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.

Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?).

Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?

Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.

Thanks for any advice.

解决方案

In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:

There are a few parts to this:

  1. Extracting the text from your docs to make them indexable
  2. Making this text available as full text search
  3. Returning highlighted snippets of the doc
  4. Knowing where in the doc those snippets are found to allow for paging
  5. Return the full doc

What can ElasticSearch provide:

  1. ElasticSearch (like Solr) uses Tika to extract text and metadata from a wide variety of doc formats
  2. It, pretty obviously, provides powerful full text search. It can be configured to analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
  3. It can return highlighted snippets for each search result
  4. It DOESN'T know where those snippets occur in your doc
  5. It can store the original doc as an attachment, or it can store and return the extracted text. But it'll return the whole doc, not a page.

You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

First the indexing part: storing your docs in ElasticSearch:

  1. Use Tika (or whatever you're comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
  2. Also extract the metadata for each doc: title, authors, chapters, language, dates etc
  3. Store the original doc in your filesystem, and record the path so that you can serve it later
  4. In ElasticSearch, index a "doc" doc which contains all of the metadata, and possibly the list of chapters
  5. Index each page as a "page" doc, which contains:

    • A parent field which contains the ID of the "doc" doc (see "Parent-child relationship" below)
    • The text
    • The page number
    • Maybe the chapter title or number
    • Any metadata which you want to be searchable

Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.

Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "text" : "interesting keywords"
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:

Step 1: Do a top-children-query to find the parent ("doc") whose children ("page") best match the query:

curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '
{
   "query" : {
      "top_children" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "score" : "sum",
         "type" : "page",
         "factor" : "5"
      }
   }
}

Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "filtered" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "filter" : {
            "terms" : {
               "doc_id" : [ 1,2,3],
            }
         }
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Step 3: In your app, group the results from the above query by doc and display them.

With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "and" : [
               {
                  "term" : {
                     "doc_id" : 1
                  }
               },
               {
                  "term" : {
                     "page" : 2
                  }
               }
            ]
         }
      }
   },
   "size" : 1
}
'

Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num (eg 123_2) then you can just retrieve that page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2

Parent-child relationship:

Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").

This enables you to run the top-children-query which will find the best matching "doc" based on the content of the "pages".

这篇关于成千上万个文档的可搜索归档的最佳做法(pdf和/或xml)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆