马尔可夫链文本生成 [英] Markov Chain Text Generation

查看:171
本文介绍了马尔可夫链文本生成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们只是分配一个新的项目,我的数据结构类 - 生成文本马尔可夫链。

概述

由于输入的文本文件,我们创建了一个长度为n个字符的初始种子。我们添加到我们的输出字符串,然后选择基于频率的分析。

我们的下一个字符
  

这是猫,有两条狗。

 初始种子:TH
可能的下一个字母 -  I,E,E
因此,选择我的概率是1/3,e是2/3。

现在,假设我们选择我。我们增加了我的输出字符串。然后,我们的种子变成
喜并且处理继续。
 

我的解决方案

我有3个班,节点,ConcreteTrie和驱动程序

当然,ConcreteTrie类不是传统意义上的特里。下面是它如何工作的:

判刑,其中k = 2:

  

这是猫,有两条狗。

我产生结点TH,喜的是,... + ...,GS,S。 每个节点有孩子是它们后面的字母。例如,结点TH也有小孩我和电子邮件。我保持计数在每一个这些节点,这样我可以在以后产生的概率选择下一个字母。

我的问题:

所有的

一,什么是最有效的方式来完成这个项目?我的解决方案似乎是非常快的,但我真的想敲我的教授的袜子。 (在我的上一个项目的编辑距离问题的变种,我做了一个A *,遗传算法,BFS和模拟退火 - 我知道,这个问题是一个NP难)

二,什么是这项任务的意义呢?它并没有真正似乎涉及到很多东西我们已经介绍了在课堂上。什么是我们应该学习?

解决方案

在与你在课堂上覆盖(你的第二个问题)的这个任务的相关性。的想法'的数据结构的'类是让学生接触到非常多的结构CS经常遇到:列表,栈,队列,哈希,不同种类的树木,图表在逃,各矩阵信条和贪婪等,并提供一些洞察他们的共同实现,他们的长处和弱点,通常在不同的应用领域。
因为几乎所有的游戏/益智/问题可以被映射到某个集合这些结构中,也不乏主题后才能进行讲课和作业。 您的类看起来有趣,因为同时保留了一些专注于这些结构,你也有机会为发现实际应用
例如,在一个变相的时尚猫和两只狗的事情是介绍应用语言学统计模型。你的好奇心和动力促使你做出与马尔科夫模型之间的关系,这是一件好事,因为很可能你会遇到马氏毕业前几次;-),当然在职业生涯中的CS或相关领域。因此,是!可能的似乎的,你butterflying各地的许多应用程序等,但只要你得到的感觉是什么结构和算法在特定情况下选择,你不会浪费你的时间!

现在,一些提示就可能的方法来分配
线索好像对这类问题的一个自然的支持。也许你可以问问自己,但是这种方法将如何扩展,如果你有索引说一整本书,而不是这短短的一句话。似乎大多线性,虽然这取决于如何对三个跳在线索每个选择(对于该第二阶Markov链):作为选择的数量增加,选路径可能变得不那么有效
该指数的构建一套可行的替代存储是一个 stochatisc矩阵(实际上是一个'纯'如果只有稀疏矩阵,在收集过程中的统计数据,随机翻在结束的时候,你归各行-or column-根据您设置)来总结 - 达人(100%)。这种矩阵将大致729×28,并允许索引,在一个单人操作,一两个字母的元组及其相关下面这封信的。 (我得到了28,包括开始和停止信号,细节...)
这种更有效的索引的成本是使用额外的空间。 空格明智线索是非常有效的,只有存储信三胞胎切实存在的组合,矩阵但浪费了一些空间(你到底会很人烟稀少打赌,即使索引多后,更多的文本狗/猫之句。)
尺寸与CPU妥协是很常见的,但也有一些算法/结构somtimes比别人强在这两方面......此外,该矩阵的方法将无法扩展很好,大小wize,如果问题改为基地从preceding字母的选择说,三个大字。
尽管如此,也许考虑矩阵作为备用实现。这是非常这类精神为尝试不同的结构,看看为什么/他们在哪里比别人好(在特定任务的上下文中)。
一个小的一面之旅,你可以是创建一个 标签云基于概率的字母对(或三重):两个线索和基质包含所有必需的数据;基质与所有其有趣的性质,可能更适合于这一点。
玩得开心!

We were just assigned a new project in my data structures class -- Generating text with markov chains.

Overview

Given an input text file, we create an initial seed of length n characters. We add that to our output string and choose our next character based on frequency analysis..

This is the cat and there are two dogs.

Initial seed: "Th"
Possible next letters -- i, e, e
Therefore, probability of choosing i is 1/3, e is 2/3.

Now, say we choose i. We add "i" to the output string. Then our seed becomes
hi and the process continues.

My solution

I have 3 classes, Node, ConcreteTrie, and Driver

Of course, the ConcreteTrie class isn't a Trie of the traditional sense. Here is how it works:

Given the sentence with k=2:

This is the cat and there are two dogs.

I generate Nodes Th, hi, is, ... + ... , gs, s. Each of these nodes have children that are the letter that follows them. For example, Node Th would have children i and e. I maintain counts in each of those nodes so that I can later generate the probabilities for choosing the next letter.

My question:

First of all, what is the most efficient way to complete this project? My solution seems to be very fast, but I really want to knock my professor's socks off. (On my last project A variation of the Edit distance problem, I did an A*, a genetic algorithm, a BFS, and Simulated Annealing -- and I know that the problem is NP-Hard)

Second, what's the point of this assignment? It doesn't really seem to relate to much of what we've covered in class. What are we supposed to learn?

解决方案

On the relevance of this assignment with what you covered in class (Your second question). The idea of a 'data structures' class is to expose students to the very many structures frequently encountered in CS: lists, stacks, queues, hashes, trees of various types, graphs at large, matrices of various creed and greed, etc. and to provide some insight into their common implementations, their strengths and weaknesses and generally their various fields of application.
Since most any game / puzzle / problem can be mapped to some set of these structures, there is no lack of subjects upon which to base lectures and assignments. Your class seems interesting because while keeping some focus on these structures, you are also given a chance to discover real applications.
For example in a thinly disguised fashion the "cat and two dogs" thing is an introduction to statistical models applied to linguistics. Your curiosity and motivation prompted you to make the relation with markov models and it's a good thing, because chances are you'll meet "Markov" a few more times before graduation ;-) and certainly in a professional life in CS or related domain. So, yes! it may seem that you're butterflying around many applications etc. but so long as you get a feel for what structures and algorithms to select in particular situations, you're not wasting your time!

Now, a few hints on possible approaches to the assignment
The trie seems like a natural support for this type of problem. Maybe you can ask yourself however how this approach would scale, if you had to index say a whole book rather than this short sentence. It seems mostly linearly, although this depends on how each choice on the three hops in the trie (for this 2nd order Markov chain) : as the number of choices increase, picking a path may become less efficient.
A possible alternative storage for the building of the index is a stochatisc matrix (actually a 'plain' if only sparse matrix, during the statistics gathering process, turned stochastic at the end when you normalize each row -or column- depending on you set it up) to sum-up to one (100%). Such a matrix would be roughly 729 x 28, and would allow the indexing, in one single operation, of a two-letter tuple and its associated following letter. (I got 28 for including the "start" and "stop" signals, details...)
The cost of this more efficient indexing is the use of extra space. Space-wise the trie is very efficient, only storing the combinations of letter triplets effectively in existence, the matrix however wastes some space (you bet in the end it will be very sparsely populated, even after indexing much more text that the "dog/cat" sentence.)
This size vs. CPU compromise is very common, although some algorithms/structures are somtimes better than others on both counts... Furthermore the matrix approach wouldn't scale nicely, size-wize, if the problem was changed to base the choice of letters from the preceding say, three characters.
None the less, maybe look into the matrix as an alternate implementation. It is very much in spirit of this class to try various structures and see why/where they are better than others (in the context of a specific task).
A small side trip you can take is to create a tag cloud based on the probabilities of the letters pairs (or triplets): both the trie and the matrix contain all the data necessary for that; the matrix with all its interesting properties, may be more suited for this.
Have fun!

这篇关于马尔可夫链文本生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆