如何有效地从连续的字符串中提取文字? [英] How to extract literal words from a consecutive string efficiently?

查看:210
本文介绍了如何有效地从连续的字符串中提取文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能重复:
如何将没有空格的文本拆分成单词列表?

人们的评论中有大量文本信息,这些信息是从html解析的,但是其中没有定界字符.例如:thumbgreenappleactiveassignmentweeklymetaphor.显然,字符串中有拇指",绿色",苹果"等.我也有一个大词典来查询这个词是否合理. 那么,提取这些单词的最快方法是什么?

解决方案

正如eumiro指出的那样,我不太确定天真的算法能否很好地满足您的目的,因此,我将介绍一个稍微复杂一些的算法./p>

想法

最好的方法是建模输出的分布.一个很好的第一近似是假设所有单词都是独立分布的.然后,您只需要知道所有单词的相对频率即可.可以合理地假设它们遵循Zipf定律,即单词列表中排名为 n 的单词的概率大约为1/( n log N ),其中 N 是词典中的单词数.

一旦固定了模型,就可以使用动态编程来推断空间的位置.最有可能的句子是使每个单词的概率乘积最大化的句子,并且通过动态编程可以很容易地计算出该句子.代替直接使用概率,我们使用定义为概率倒数的对数的代价来避免溢出.

代码

import math

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

可以与之配合使用

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))


示例

我正在使用我整理的这本速读的125k单词词典,来自一小部分Wikipedia

之前:拇指绿苹果活动分配每周简报.
之后:拇指绿苹果活跃作业每周的隐喻.

之前:存在论者可以从html解析出的人们评论的软扩展信息,然后逐渐了解 ode字符限制,例如拇指绿色苹果活动分配每周metapho 明显地,在这些字符串中,以英语为母语的绿色苹果等查询 这个词是否是最快的提取方法.

之后:有大量从html解析的人民评论文本信息,但是其中没有定界字符,例如,拇指绿苹果活动任务每周隐喻显然有拇指绿苹果等在字符串中,我还有一个大词典来查询单词是否合理,所以提取很多最快的方法是什么.

之前:在暴风雨的夜晚,暴雨暴雨暴雨,除了偶然的间隔,当被狂风吹过的一阵阵风席卷了整个伦敦的炎性大风之后,我们的目光就散布在房顶上,而费希尔则紧紧地盯着那头灯火似的.

此后:那是一个黑暗而暴风雨的夜晚,雨水倾盆而下,除了偶尔被狂风吹过而横扫街道的时候,因为伦敦是伦敦场景躺在房顶上嘎嘎作响,剧烈地搅动着与黑暗搏斗的灯火.

如您所见,它本质上是完美的.最重要的部分是确保您的单词列表经过训练后的语料库与实际遇到的语料库相似,否则结果将非常糟糕.


优化

该实现消耗线性时间和内存,因此它是相当有效的.如果需要进一步的提速,可以从单词列表中构建后缀树,以减少候选集的大小.

如果需要处理非常大的连续字符串,则合理地分割字符串以避免过多的内存使用.例如,您可以按10000个字符的块处理文本,并在每侧加1000个字符的边距,以避免边界效应.这样可以将内存使用率降到最低,并且几乎可以肯定不会对质量产生任何影响.

Possible Duplicate:
How to split text without spaces into list of words?

There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor. Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable. So, what's the fastest way to extract these words?

解决方案

I'm not really sure a naive algorithm would serve your purpose well, as pointed out by eumiro, so I'll describe a slightly more complex one.

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

import math

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))


Examples

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

这篇关于如何有效地从连续的字符串中提取文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆