在python中不带空格的句子拆分(nltk?) [英] split sentence without space in python (nltk?)
本文介绍了在python中不带空格的句子拆分(nltk?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一组串联词,我想将它们分成数组
I have a set of concatenated word and i want to split them into arrays
例如:
split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']
我找到了pyenchant
,但不适用于64位Windows.
I found pyenchant
, but it's not available for 64bit windows.
然后,我尝试将每个字符串拆分为子字符串,然后将它们与wordnet进行比较以找到等效的单词. 例如:
Then i tried to split each string into sub string and then compare them to wordnet to find a equivalent word. For example :
from nltk import wordnet as wn
def split_word(self, word):
result = list()
while(len(word) > 2):
i = 1
found = True
while(found):
i = i + 1
synsets = wn.synsets(word[:i])
for s in synsets:
if edit_distance(s.name().split('.')[0], word[:i]) == 0:
found = False
break;
result.append(word[:i])
word = word[i:]
print(result)
但是此解决方案不确定,而且时间太长. 因此,我正在寻找您的帮助.
But this solution is not sure and is too long. So I'm looking for your help.
谢谢
推荐答案
检查- Norvig 的作品中的20with%20Words.ipynb"rel =" noreferrer>分词任务.
Check - Word Segmentation Task from Norvig's work.
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']
要获得比此更好的解决方案,可以使用bigram/trigram.
For better solution than this you can use bigram/trigram.
更多示例,请参见:分词任务
这篇关于在python中不带空格的句子拆分(nltk?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文