如何通过nltk同义词迭代每个单词并将拼写错误的单词存储在单独的列表中？ [英] How to iterate each word through nltk synsets and store misspelled words in separate list?

查看：419 发布时间：2018/11/15 15:47:01 python iteration nltk python-3.5 wordnet

本文介绍了如何通过nltk同义词迭代每个单词并将拼写错误的单词存储在单独的列表中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用带有消息的文本文件，并通过NLTK wordnet synset函数迭代每个单词。我想这样做，因为我想创建一个拼写错误的单词列表。例如，如果我这样做：

I am trying to take a text file with messages and iterate each word through NLTK wordnet synset function. I want to do this because I want to create a list of mispelled words. For example if I do:

wn.synsets('dog')

我得到输出：

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

现在如果单词拼写错误如下：

now if the word is mispelled like so:

wn.synsets('doeg')

我得到输出：

[]

如果我返回一个空列表，我想将拼写错误的单词保存在另一个列表中，并继续迭代文件的其余部分：

If I am returned an empty list I want to save the misspelled word in another list like so and while continuing to iterate through rest of the file:

mispelled_words = ['doeg']

我不知道如何做到这一点，这是我下面的代码，我需要在变量chat_message_tokenize之后进行迭代。名称路径是我要删除的字词：

I am at a loss how to do this, here is my code below, I would need to do the iterating after variable "chat_message_tokenize". The name path is words I want to drop:

import nltk
import csv
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem.snowball import SnowballStemmer


def text_function():
    #nltk.download('punkt')
    #nltk.download('averaged_perceptron_tagger')

    # Read in chat messages and names files
    chat_path = 'filepath.csv'
    try:
        with open(chat_path) as infile:
            chat_messages = infile.read()
    except Exception as error:
        print(error)
        return

    name_path = 'filepath.txt'
    try:
        with open(names_path) as infile:
            names = infile.read()
    except Exception as error:
        print(error)
        return

    chat_messages = chat_messages.split('Chats:')[1].strip()
    names = names.split('Name:')[1].strip().lower()

    chat_messages_tokenized = nltk.word_tokenize(chat_messages)
    names_tokenized = nltk.word_tokenize(names)

    # adding part of speech(pos) tag and dropping proper nouns
    pos_drop = pos_tag(chat_messages_tokenized)
    chat_messages_tokenized = [SnowballStemmer('english').stem(word.lower()) for word, pos in pos_drop if pos != 'NNP' and word not in names_tokenized]

    for chat_messages_tokenized 

    if not wn.synset(chat_messages_tokenized):
        print('empty list')

if __name__ == '__main__':
    text_function()    

#    for s in wn.synsets('dog'):
#          lemmas = s.lemmas()
#    for l in lemmas:
#          if l.name() == stemmer:
#              print (l.synset())


    csv_path ='OutputFilePath.csv'
    try:
        with open(csv_path, 'w') as outfile:
            writer = csv.writer(outfile)
            for word in chat_messages_tokenized:
                writer.writerow([word])
    except Exception as error:
        print(error)
        return


if __name__ == '__main__':
    text_function()

提前谢谢。

如何通过nltk同义词迭代每个单词并将拼写错误的单词存储在单独的列表中？ [英] How to iterate each word through nltk synsets and store misspelled words in separate list?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过nltk同义词迭代每个单词并将拼写错误的单词存储在单独的列表中？ [英] How to iterate each word through nltk synsets and store misspelled words in separate list?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭