使用字典标记标记句子中的单词 [英] Tagging words in sentences using dictionares

查看:188
本文介绍了使用字典标记标记句子中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的语料库超过1​​0万个句子,而且我有词典.我想匹配语料库中的单词并在句子中标记它们

I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences

语料库文件"sentences.txt"

corpus file "sentences.txt"

Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems

字典文件"dict.csv"

Dictionary file "dict.csv"

abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            for line in hay.splitlines():
                if max_sim_string in line:
                    tag_sent = line.replace(max_sim_string, str1.__str__())
                    my3file.writelines(tag_sent + '\n')
                    print(tag_sent)
            break

csvFile.close()

我的当前输出是

 he has ('anxiety', ' disorder') thats why he is behaving like that.
 ('Malaria', ' virus') can be cure
 Hello how are you doing. ('Headache', ' symptom') is dangerous

我希望输出为.我希望它在同一文件"sentences.txt"中标记句子中的单词或将其写入新文件"myfile3.txt."而不会打扰句子的顺序或完全忽略(不添加)

I want my output as. i want it tags the words in the sentences in the same file "sentences.txt" or write it in new file "myfile3.txt. without disturbing the order of sentences or totally ignore (not adding) it

 Hello how are you doing. ('Headache', 'symptom') is dangerous
 ('Malaria', ' virus') can be cure.
 he has ('anxiety', ' disorder') thats why he is behaving like that
 she is doing well
 he has psychological problems

推荐答案

无需在代码中进行太多更改,这应该可以使它工作:

Without changing much in your code this should make it work:

...
phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()

这篇关于使用字典标记标记句子中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆