有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表? [英] Is there an easy way generate a probable list of words from an unspaced sentence in python?
问题描述
我有一些文字:
s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
我想将其解析为单独的单词.我迅速查看了附魔和nltk,但没有发现任何看起来立即有用的东西.如果我有时间对此进行投资,我会考虑编写一个具有附魔能力的动态程序,以检查单词是否为英语.我本以为会在网上做些什么,对吗?
I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?
推荐答案
使用特里的贪婪方法
尝试使用 Biopython (pip install biopython
):
from Bio import trie
import string
def get_trie(dictfile='/usr/share/dict/american-english'):
tr = trie.trie()
with open(dictfile) as f:
for line in f:
word = line.rstrip()
try:
word = word.encode(encoding='ascii', errors='ignore')
tr[word] = len(word)
assert tr.has_key(word), "Missing %s" % word
except UnicodeDecodeError:
pass
return tr
def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s
def main(s):
tr = get_trie()
while s:
word, s = get_trie_word(tr, s)
print word
if __name__ == '__main__':
s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
s = s.strip(string.punctuation)
s = s.replace(" ", '')
s = s.lower()
main(s)
结果
>>> if __name__ == '__main__':
... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
... s = s.strip(string.punctuation)
... s = s.replace(" ", '')
... s = s.lower()
... main(s)
...
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches
注意事项
有些简陋的英语案例无法解决.您需要使用回溯处理这些问题,但这应该可以帮助您入门.
Caveats
There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.
>>> main("expertsexchange")
experts
exchange
这篇关于有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!