在python中不带空格的句子拆分(nltk?) [英] split sentence without space in python (nltk?)

查看：237 发布时间：2020/5/18 1:15:02 python nltk spell-checking wordnet

本文介绍了在python中不带空格的句子拆分(nltk?)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组串联词，我想将它们分成数组

I have a set of concatenated word and i want to split them into arrays

例如:

split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']

我找到了pyenchant，但不适用于64位Windows.

I found pyenchant, but it's not available for 64bit windows.

然后，我尝试将每个字符串拆分为子字符串，然后将它们与wordnet进行比较以找到等效的单词. 例如:

Then i tried to split each string into sub string and then compare them to wordnet to find a equivalent word. For example :

from nltk import wordnet as wn
def split_word(self, word):
    result = list()
    while(len(word) > 2):
        i = 1
        found = True
        while(found):
            i = i + 1
            synsets = wn.synsets(word[:i])
            for s in synsets:
                if edit_distance(s.name().split('.')[0], word[:i]) == 0:
                    found = False
                    break;
        result.append(word[:i])
        word = word[i:]
   print(result)

但是此解决方案不确定，而且时间太长. 因此，我正在寻找您的帮助.

But this solution is not sure and is too long. So I'm looking for your help.

谢谢

推荐答案

检查- Norvig 的作品中的20with％20Words.ipynb"rel =" noreferrer>分词任务.

Check - Word Segmentation Task from Norvig's work.

from __future__ import division
from collections import Counter
import re, nltk

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']

要获得比此更好的解决方案，可以使用bigram/trigram.

For better solution than this you can use bigram/trigram.

更多示例，请参见:分词任务

这篇关于在python中不带空格的句子拆分(nltk?)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中不带空格的句子拆分(nltk?) [英] split sentence without space in python (nltk?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中不带空格的句子拆分(nltk?) [英] split sentence without space in python (nltk?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭