在(解析)树的集合中查找最频繁的子树 [英] Finding the most frequent subtrees in a collection of (parse) trees

查看：16 发布时间：2022/1/5 18:54:59 algorithm tree subtree

本文介绍了在(解析)树的集合中查找最频繁的子树的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个树的集合，它们的节点被标记(但不是唯一的).具体来说，这些树来自一组已解析的句子(参见 http://en.wikipedia.org/wiki/树库).我希望从集合中提取最常见的子树 - 性能(还)不是问题.我会很感激算法(最好是 Java)或指向为树库执行此操作的工具的指针.请注意，子节点的顺序很重要.

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.

编辑 @mjv.我们在一个有限的领域(化学)工作，它有一种风格化的语言，所以树的种类并不多 - 可能类似于儿童的读者.猫坐在垫子上"的简单树.

EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".

<sentence>
  <nounPhrase>
    <article/>
    <noun/>
  </nounPhrase>
  <verbPhrase>
    <verb/>
    <prepositionPhrase>
      <preposition/>
      <nounPhrase>
        <article/>
        <noun/>
      </nounPhrase>
    </prepositionPhrase>
  </verbPhrase>
</sentence>

这里的句子包含两个相同的词性子树(实际标记cat".mat"在匹配中并不重要).所以算法需要检测到这一点.请注意，并非所有名词短语都是相同的——大黑猫"可能是:

Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:

      <nounPhrase>
        <article/>
        <adjective/>
        <adjective/>
        <noun/>
      </nounPhrase>

句子的长度会更长——在 15 到 30 个节点之间.我希望从 1000 棵树中得到有用的结果.如果这不超过一天左右，那是可以接受的.

The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.

显然树越短越频繁，所以 nounPhrase 会很常见.

Obviously the shorter the tree the more frequent, so nounPhrase will be very common.

EDIT 如果要通过展平树来解决这个问题，那么我认为它与最长公共子串有关，而不是与最长公共序列有关.但请注意，我不一定只想要最长的 - 我想要一个足够长的列表以有趣"(标准尚未确定).

EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

在(解析)树的集合中查找最频繁的子树 [英] Finding the most frequent subtrees in a collection of (parse) trees

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在(解析)树的集合中查找最频繁的子树 [英] Finding the most frequent subtrees in a collection of (parse) trees

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭