有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表? [英] Is there an easy way generate a probable list of words from an unspaced sentence in python?

查看:152
本文介绍了有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些文字:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想将其解析为单独的单词.我迅速查看了附魔和nltk,但没有发现任何看起来立即有用的东西.如果我有时间对此进行投资,我会考虑编写一个具有附魔能力的动态程序,以检查单词是否为英语.我本以为会在网上做些什么,对吗?

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

推荐答案

使用特里的贪婪方法

尝试使用 Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

结果

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

注意事项

有些简陋的英语案例无法解决.您需要使用回溯处理这些问题,但这应该可以帮助您入门.

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

>>> main("expertsexchange")
experts
exchange

这篇关于有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆