易DAWG创建的算法? [英] Easy DAWG creation algorithm?

查看:147
本文介绍了易DAWG创建的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要创建一个DAWG(http://en.wikipedia.org/wiki/Directed_acyclic_word_graph)结构为我的拼字游戏玩家给出一个文件中的单词列表。我使用的Java。我需要做的只有一次,然后将其存储在一个文件或文件。我到目前为止看到的方法2:1)建立一个特里和其降低到一个DAWG或2)建立DAWG的时候了。因为我需要做的只有一次我想我只是想最简单的算法来实现该做的。速度和内存的要求并不重要。

I need to create a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) structure for my Scrabble player given the word list in a file. I'm using Java. I need to do it only once and then store it in a file or files. I've seen so far 2 approaches: 1) build a Trie and reduce it to a DAWG or 2) build a DAWG right away. Since I need to do it only once I guess I just want the easiest algorithm to implement that does it. Speed and memory requirements don't matter.

此外,我想知道我应该如何在内存中存储的结构在运行时,我应该如何保存在一个文件?该DAWG基本上是一个使用我写的一些非常简单的类的一些节点和边缘/指针建议的图表,但我看到了使用数组和偏移量(此数组)实现这看似复杂和难以辨认。这一次装载DAWG /使用DAWG我关心两个有关内存的大小(在运行时和保存的文件)和速度。

Also I want to know how should I store the structure in memory at runtime and how should I save it in a file? The DAWG is basically a graph which suggests using some nodes and edges/pointers of some very simple classes written by me but I saw implementations using array and offsets (in this array) which seems complicated and illegible. This time I care both about memory size (at runtime and of the saved file) and speed of loading the DAWG/using the DAWG.

推荐答案

我不得不实施用C这样的结构,我的一个客户的一次。最终的结构是从描述的字符集和耶一个XML文件中加载,另一个进程从一个单词列表创建的XML文件。

I had to implement such a structure in C for one of my client once. The final structure is loaded from an xml file describing the character set and the dawg, another process created the xml file from a word list.

我们使用的:

typedef struct _s_build_node build_node_t;
struct _s_build_node {
  char_t letter;
  build_node_t* first_child;
  build_node_t* right_sibling;

  hash_t hash;
  size_t depth;
  size_t ID;
};
typedef struct _s_build_dawg {
  charset_t charset;
  node_t* all_nodes; // an array of all the created nodes
  node_t* root;
} build_dawg_t;

siblibgs是有序的上升,结束单词的特殊字符小于任何其他字符。 该算法很简单:

siblibgs are ordered ascending, end-of-word special character is less than any other character. The algorithm is quite simple :

// create the build dawg
foreach word in wordlist
  insert(dawg, word)
// compact the dawg
compact(dawg)
// generate the xml file
xml_dump(dawg)

为了来压缩耶,我们计算每个节点的散列值。具有相同散列两个节点可以是因式分解的。这部分可能会非常棘手。只有具有最低深度的节点被保留,其余被删除和他们的父母现在都指向一个不停。
一旦压实我们分配一个唯一的ID到每个节点(通过BFS,身份证介于0和N-1,N是节点的压实耶数)。 XML文件简单地描述线索:

In order to compact the dawg, we computed a hash value for each node. Two nodes with the same hash can be factorized. This part can be tricky. Only the node with the lowest depth is kept, the others are deleted and their parents now point to the one kept.
Once compacted we assign a unique ID to each node (via bfs, ID are between 0 and N-1, N is the number of nodes in the compacted dawg). The xml file simply described the trie :

<dawg>
  <charset ....>
    ...
  </charset>

  <node ID="node_id" letter="letter" fist_child="first_child_ID" next_sibling="next_sibling_id" />
  <node .... />
  <node .... />
  <node .... />
</dawg>

步骤2:最终dagw

结构是一点点简单

step 2 : The final dagw

The structure is a little bit simpler

typedef struct {
  char_t letter;
  size_t first_child;
  size_t next_sibling;
} node_t;

typedef struct {
  node_t nodes[];
  ... whatever you need ...
} dawg_t;

下面根是dawg.nodes [0],和FIRST_CHILD / NEXT_SIBLING是在节点数组的索引。创建这样一个结构是从XML文件轻松。的主要缺点是任何单词一览修改触发新的xml文件的生成

Here root is dawg.nodes[0], and first_child/next_sibling is an index in the nodes array. Creating such a struct is easy from the xml file. The main drawback is that any wordlist modification triggers the generation of a new xml file.

这篇关于易DAWG创建的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆