使用索引数据计算非结构化文档中的所有唯一单词 [英] Counting all unique words in an unstructured document using index data

查看:71
本文介绍了使用索引数据计算非结构化文档中的所有唯一单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将非结构化HTML文档加载到Marklogic中,并且, 对于任何给定的文档URI,我需要一种使用索引/词典为所有唯一单词提供单词计数的方法.

I've loaded unstructured HTML documents into Marklogic and, for any given document URI, I need a way to use indexes/lexicons to provide a word count for all unique words.

例如,假设我有下面的文件,保存在URI"/html/example.html"下:

For example, say I have the file below, saved under the URI "/html/example.html":

<html>
<head><title>EXAMPLE</title></head>
<body>
<h1>This is a header</h1>
<div class="highlight">This word is highlighted</div>
<p> And these words are inside a paragraph tag</p>
</body>
</html>

在XQuery中,我将通过传入URI来调用函数,并获得以下结果:

In XQuery, I'd call my function passing in a by passing in the URI and get the following results:

EXAMPLE 1
This 2
is 2
a 2
header 1
word 1
highlighted 1
And 1
these 1
words 1
are 1
inside 1
paragraph 1
tag 1

请注意,我只需要对标签内的单词进行单词计数,而无需对标签本身进行计数.

Note that I only need a word count on words inside of tags, not on the tags themselves.

有什么方法可以有效地做到这一点(使用索引或词典数据吗?)

Is there any way to do this efficiently (using index or lexicon data?)

谢谢

草f

推荐答案

您要查询针对任何给定文档URI"的字数统计.但是您假设解决方案涉及索引或词典,而这不一定是一个很好的假设.如果您想从面向文档的数据库中获取特定于文档的内容,通常最好直接处理该文档.

You're asking for word counts "for any given document URI". But you are assuming that the solution involves indexes or lexicons, and that's not necessarily a good assumption. If you want something document-specific from a document-oriented database, it's often best to work on the document directly.

因此,让我们专注于针对单个文档的高效字数统计解决方案,然后从那里开始.好吧

So let's focus on an efficient word-count solution for a single document, and go from there. OK?

这是我们如何获取单个元素(包括所有子元素)的字数的方法.这可能是文档的根目录:doc($uri)/*.

Here's how we could get word counts for a single element, including any children. This could be the root of your document: doc($uri)/*.

declare function local:word-count($root as element())
as map:map
{
  let $m := map:map()
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

这会产生一张地图,我发现它比平面文字更灵活.每个键是一个单词,值是计数.变量$doc已经包含您的示例XML.

This produces a map, which I find more flexible than flat text. Each key is a word, and the value is the count. The variable $doc already contains your sample XML.

let $m := local:word-count($doc)
for $k in map:keys($m)
return text { $k, map:get($m, $k) }

inside 1
This 2
is 2
paragraph 1
highlighted 1
EXAMPLE 1
header 1
are 1
word 1
words 1
these 1
tag 1
And 1
a 2

请注意,映射键的顺序不确定.如果需要,添加order by子句.

Note that the order of the map keys is indeterminate. Add an order by clause if you like.

let $m := local:word-count($doc)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

如果要查询整个数据库,使用cts:words的Geert解决方案可能看起来不错.它为词列表使用词典,并为词匹配使用一些索引查找.但这最终将遍历每个词-词典词:O(nm)的每个匹配文档的XML.为了正确地做到这一点,代码必须像local:word-count那样工作,但是一次只能写一个单词.许多单词将匹配相同的文档:"the"可能在A和B中,而"then"也可能在A和B中.尽管使用了词典和索引,但通常这种方法比将local:word-count简单地应用于整个数据库.

If you want to query the entire database, Geert's solution using cts:words might look pretty good. It uses a lexicon for the word list, and some index lookups for word matching. But it will end up walking the XML for every matching document for every word-lexicon word: O(nm). To do that properly the code will have to do work similar to what local:word-count does, but for one word at a time. Many words will match the same documents: 'the' might be in A and B, and 'then' might also be in A and B. Despite using lexicons and indexes, usually this approach will be slower than simply applying local:word-count to the whole database.

如果要查询整个数据库并愿意更改XML,则可以将每个单词包装在word元素(或您喜欢的任何元素名称)中.然后在word上创建字符串类型的元素范围索引.现在,您可以使用cts:valuescts:frequency直接从范围索引中提取答案.这将是O(n),其成本比cts:words方法低得多,并且可能比local:word-count更快,因为根本不会访问任何文档.但是生成的XML非常笨拙.

If you want to query the entire database and are willing to change the XML, you could wrap every word in a word element (or whatever element name you prefer). Then create an element range index of type string on word. Now you can use cts:values and cts:frequency to pull the answer directly from the range index. This will be O(n) with a much lower cost than the cts:words approach, and probably faster than local:word-count, because won't visit any documents at all. But the resulting XML is pretty clumsy.

让我们返回并将local:word-count应用于整个数据库.首先调整代码,以便调用者提供地图.这样,我们可以建立一个包含整个数据库的字数统计的地图,而我们只查看每个文档一次.

Let's go back and apply local:word-count to the whole database. Start by tweaking the code so that the caller supplies the map. That way we can build up a single map that has word counts for the whole database, and we only look at each document once.

declare function local:word-count(
  $m as map:map,
  $root as element())
as map:map
{
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

let $m := map:map()
let $_ := local:word-count($m, collection()/*)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

在我的笔记本电脑上,这可以在不到100毫秒的时间内处理151个文档.大约有8100个单词和925个不同的单词.从cts:wordscts:search获得相同的结果花费了不到1秒的时间.因此,local:word-count效率更高,并且可能足以完成这项工作.

On my laptop this processed 151 documents in less than 100-ms. There were about 8100 words and 925 distinct words. Getting the same results from cts:words and cts:search took just under 1-sec. So local:word-count is more efficient, and probably efficient enough for this job.

现在,您可以有效地构建字数图,如果可以保存它,该怎么办?本质上,您将建立我们自己的字数索引".这很容易,因为地图具有XML序列化.

Now that you can build a word-count map efficiently, what if you could save it? In essence, you'd build our own "index" of word counts. This is easy, because maps have an XML serialization.

(: Construct a map. :)
map:map()
(: The document constructor creates a document-node with XML inside. :)
! document { . }
(: Construct a map from the XML root element. :)
! map:map(*)

因此,您可以在每个新的XML文档插入或更新时对其调用local:word-count.然后将字数图存储在文档的属性中.使用CPF管道,或通过RecordLoader使用您自己的代码,或在REST上传端点中执行此操作.

So you could call local:word-count on each new XML document as it's inserted or updated. Then store the word-count map in the document's properties. Do this using a CPF pipeline, or using your own code via RecordLoader, or in a REST upload endpoint, etc.

当您希望单个文档的字数统计时,这只是对xdmp:document-propertiesxdmp:document-get-properties的调用,然后在正确的XML上调用map:map构造函数.如果您希望多个文档的字数统计,则可以轻松编写XQuery将这些地图合并为一个结果.

When you want word counts for a single document, that's just a call to xdmp:document-properties or xdmp:document-get-properties, then call the map:map constructor on the right XML. If you want word counts for multiple documents, you can easily write XQuery to merge those maps into a single result.

这篇关于使用索引数据计算非结构化文档中的所有唯一单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆