将连接的单词串分解为单个单词的快速方法 [英] Fast way to break a joined string of words into individual words

查看:128
本文介绍了将连接的单词串分解为单个单词的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有这个字符串:

hellohowareyou

是否有一种快速的方法可以将其分成单个单词,所以最终结果是hello how are you?我可以想到几种方法,但是它们会非常慢(首先我需要根据字典来识别每个字母,查看哪些字母组成一个单词,并且可能会有多个组合,然后我需要确定最可能的组合,等等. )

Is there a fast way to separate this into individual words, so the end result is hello how are you? I can think of several ways, but they would be EXTREMELY slow (first I need to identify each letter against a dictionary, see which letters compose a words, and there would be probably multiple combinations, then I need to decide the most likely combination etc.)

推荐答案

下面是一些执行递归蛮力搜索的代码.它将单词列表放到一个集合中,因此查找速度非常快:下面的示例在装有2GB RAM的旧2 GHz计算机上运行不到1秒.但是,比我使用的示例分割更长的序列肯定会花费更长的时间,主要是因为有很多可能的组合.要清除无意义的结果,您要么需要手动执行操作,要么使用可以进行自然语言处理的软件.

Here's some code that does a recursive brute-force search. It puts the word list into a set, so the lookups are quite fast: the examples below run in less than 1 second on my old 2 GHz machine with 2GB of RAM. However, splitting longer sequences than the examples I've used will certainly take longer, mostly because there are so many possible combinations. To weed out the meaningless results you will either need to do that manually, or use software that can do natural language processing.

#!/usr/bin/env python3

''' Separate words

    Use dictionary lookups to recursively split a string into separate words

    See http://stackoverflow.com/q/41241216/4014959

    Written by PM 2Ring 2016.12.21
'''

# Sowpods wordlist from http://www.3zsoftware.com/download/

fname = 'scrabble_wordlist_sowpods.txt'
allwords = set('AI')
with open(fname) as f:
    for w in f:
        allwords.add(w.strip())

def parse(data, result=None):
    if result is None:
        result = []
    if data in allwords:
        result.append(data)
        yield result[::-1]
    else:
        for i in range(1, len(data)):
            first, last = data[:i], data[i:]
            if last in allwords:
                yield from parse(first, result + [last])

# Test

data = (
    'HELLOHOWAREYOU',
    'THISEXAMPLEWORKSWELL',
    'ISTHEREAFASTWAY',
    'ONE',
    'TWOWORDS',
)

for s in data:
    print(s)
    for u in parse(s):
        print(u)
    print('')    

输出

HELLOHOWAREYOU
['HELL', 'OHO', 'WARE', 'YOU']
['HELLO', 'HO', 'WARE', 'YOU']
['HELLO', 'HOW', 'ARE', 'YOU']
['HELL', 'OH', 'OW', 'ARE', 'YOU']
['HELLO', 'HOW', 'A', 'RE', 'YOU']
['HELL', 'OH', 'OW', 'A', 'RE', 'YOU']

THISEXAMPLEWORKSWELL
['THIS', 'EXAMPLE', 'WORK', 'SWELL']
['THIS', 'EX', 'AMPLE', 'WORK', 'SWELL']
['THIS', 'EXAMPLE', 'WORKS', 'WELL']
['THIS', 'EX', 'AMPLE', 'WORKS', 'WELL']

ISTHEREAFASTWAY
['I', 'ST', 'HER', 'EA', 'FAS', 'TWAY']
['IS', 'THERE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HERE', 'A', 'FAS', 'TWAY']
['IS', 'THE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HER', 'EA', 'FAST', 'WAY']
['IS', 'THERE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FAST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HER', 'EA', 'FA', 'ST', 'WAY']
['IS', 'THERE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FA', 'ST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FA', 'ST', 'WAY']

ONE
['ONE']

TWOWORDS
['TWO', 'WORDS']


此代码是为Python 3编写的,但是您可以通过更改使其在Python 2上运行


This code was written for Python 3, but you can make it run on Python 2 by changing

yield from parse(first, result + [last])

for seq in parse(first, result + [last]):
    yield seq


顺便说一句,我们可以按长度(即每个列表中的单词数)对输出列表进行排序.这倾向于将更明智的结果放在顶部.


BTW, we can sort the output lists by length, i.e., the number of words in each list. This tends to put the more sensible results near the top.

for s in data:
    print(s)
    for u in sorted(parse(s), key=len):
        print(u)
    print('')

这篇关于将连接的单词串分解为单个单词的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆