文本打包算法 [英] Text packing algorithm

查看:13
本文介绍了文本打包算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我敢打赌之前有人解决了这个问题,但我的搜索结果是空的.

I bet somebody has solved this before, but my searches have come up empty.

我想将一个单词列表打包到缓冲区中,跟踪每个单词的起始位置和长度.诀窍是我想通过消除冗余来有效地打包缓冲区.

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

示例:娃娃屋

这些可以像dollhouse一样简单地打包到缓冲区中,记住doll是从位置0开始的四个字母,dollhouse是九个字母在 0 处,house 在 3 处是五个字母.

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

到目前为止我想出的是:

What I've come up with so far is:

  1. 从最长到最短的单词排序:(娃娃屋、房子、娃娃)
  2. 扫描缓冲区以查看该字符串是否已作为子字符串存在,如果存在,请注意位置.
  3. 如果它不存在,请将其添加到缓冲区的末尾.

由于长词通常包含较短的词,因此效果很好,但应该可以做得更好.例如,如果我扩展单词列表以包含 ragdoll,那么我的算法会提出 dollhouseragdoll,它的效率低于 ragdollhouse.

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

这是一个预处理步骤,所以我不太担心速度.O(n^2) 没问题.另一方面,我的实际列表有数万个单词,所以 O(n!) 可能是不可能的.

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

作为旁注,此存储方案用于 TrueType 字体的 `name' 表中的数据,参见.http://www.microsoft.com/typography/otspec/name.htm

As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm

推荐答案

这是最短超字符串问题:找到包含一组给定字符串作为子字符串的最短字符串.根据这篇 IEEE 论文(不幸的是,您可能无法访问),解决这个问题正是NP-complete.但是,启发式解决方案是可用的.

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

作为第一步,您应该找到作为其他字符串的子字符串的所有字符串并删除它们(当然您仍然需要以某种方式记录它们相对于包含字符串的位置).使用广义后缀树可以有效地找到这些完全包含的字符串.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

然后,通过重复合并具有最长重叠的两个字符串,您可以保证生成一个长度不小于最小可能长度的 4 倍的解决方案.正如 Zifre 在 上的评论所建议的那样,应该可以通过使用两个基数树来快速找到重叠大小康拉德·鲁道夫的回答.或者,您也许可以以某种方式使用广义后缀树.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

很抱歉,我无法为您找到合适的链接——似乎没有维基百科页面,或关于此特定问题的任何可公开访问的信息.在此处简要提及,尽管没有建议提供解决方案.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

这篇关于文本打包算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆