如何将没有空格的文本拆分为单词列表 [英] How to split text without spaces into list of words

查看:245
本文介绍了如何将没有空格的文本拆分为单词列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

输入:"tableapplechairtablecupboard ..." 许多单词

将此类文本拆分为单词列表并获得以下内容的有效算法是什么?

输出: [桌子",苹果",椅子",桌子",[碗柜",[杯",板"]],...]

首先想到的是遍历所有可能的单词(从第一个字母开始)并找到可能的最长单词,然后从 position = word_position + len(word)

继续

P.S.
我们列出了所有可能的单词.
单词"cupboard"可以是"cup"和"board",请选择最长.
语言:python,但主要是算法本身.

解决方案

将朴素的算法应用于现实世界的数据时,效果不佳.这是一条20行算法,该算法利用相对词频来为实词文本提供准确的结果.

(如果您想回答不使用单词频率的原始问题的答案,则需要完善最长单词"的确切含义:最好有20个字母的单词和10个3个字母的单词个字母的单词,或者最好有五个10个字母的单词?确定精确的定义后,只需更改定义 wordcost 的行以反映预期的含义.)

想法

最好的处理方法是建模输出的分布.一个很好的第一近似是假设所有单词都是独立分布的.然后,您只需要知道所有单词的相对频率即可.可以合理地假设它们遵循Zipf定律,即单词列表中排名为 n 的单词的概率大约为1/( n log N ),其中 N 是词典中的单词数.

一旦固定了模型,就可以使用动态编程来推断空间的位置.最有可能的句子是使每个单词的概率乘积最大化的句子,并且通过动态编程可以很容易地计算出该句子.代替直接使用概率,我们使用定义为概率倒数的对数的代价来避免溢出.

代码

来自数学导入日志的

 #假设Zipf定律和cost = -math.log(probability),构建成本字典.words = open("words-by-frequency.txt").read().split()wordcost = dict((k,log((i + 1)* log(len(words))))for i,k inumerate(words))maxword = max(len(x)for x in words)def infer_spaces(s):"使用动态编程来推断字符串中空格的位置没有空格."#假设费用为,找到第i个前一个字符的最佳匹配#是为i-1的第一个字符构建的.#返回一对(match_cost,match_length).def best_match(i):候选人=枚举(reversed(cost [max(0,i-maxword):i]))为候选人中的k,c返回min((c + wordcost.get(s [i-k-1:i],9e999),k + 1))#构建成本数组.费用= [0]对于范围(1,len(s)+1)中的i:c,k = best_match(i)cost.append(c)#回溯以恢复最小开销的字符串.出= []i = len而我> 0:c,k = best_match(i)断言c == cost [i]out.append(s [i-k:i])i-= k返回" .join(reversed(out)) 

可以与之配合使用

  s ='thumbgreenappleactiveassignmentweeklymetaphor'打印(infer_spaces(s)) 


结果

我正在使用这本速写的125k单词词典,它是从一小部分Wikipedia中拼凑而成的.

之前:拇指绿苹果活动分配每周简报.
之后:拇指绿苹果活跃作业每周的隐喻.

之前:存在于html上的人们评价的软扩展信息ode字符限制,例如拇指绿色苹果活动分配每周metapho明显地,在这些字符串中,以英语为母语的绿色苹果等查询这个词是否是最快的提取方法.

之后:有大量从html解析的人民评论文本信息,但是其中没有定界字符,例如,拇指绿苹果活动任务每周隐喻显然有拇指绿苹果等在字符串中,我还有一个大词典来查询单词是否合理,所以提取很多最快的方法是什么.

之前:在暴风雨之夜,暴雨暴雨暴雨,除了偶然的间隔,当被狂风吹过的一阵阵风席卷了整个伦敦的炎炎季节之后,我们的目光就散布在房顶上,并费劲地挑起了灯火似的似的.

此后:那是一个黑暗而暴风雨的夜晚,雨水倾盆而下,除了偶尔被狂风吹过而横扫街道的时候,因为伦敦是伦敦场景躺在房顶上嘎嘎作响,剧烈地搅动着与黑暗搏斗的灯火.

如您所见,它本质上是完美的.最重要的部分是确保您的单词列表经过训练后的语料库与实际遇到的语料库相似,否则结果将非常糟糕.


优化

该实现消耗线性时间和内存,因此它是相当有效的.如果需要进一步的提速,可以从单词列表中构建后缀树,以减少候选集的大小.

如果需要处理非常大的连续字符串,则合理地分割字符串以避免过多的内存使用.例如,您可以按10000个字符的块处理文本,并在每侧加1000个字符的边距,以避免边界效应.这样可以将内存使用率降到最低,并且几乎可以肯定不会对质量产生任何影响.

Input: "tableapplechairtablecupboard..." many words

What would be an efficient algorithm to split such text to the list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.

解决方案

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))


The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

这篇关于如何将没有空格的文本拆分为单词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆