使用字典标记标记句子中的单词 [英] Tagging words in sentences using dictionares
本文介绍了使用字典标记标记句子中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的语料库超过10万个句子,而且我有词典.我想匹配语料库中的单词并在句子中标记它们
I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences
语料库文件"sentences.txt"
corpus file "sentences.txt"
Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems
字典文件"dict.csv"
Dictionary file "dict.csv"
abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom
我的python程序
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("sentences.txt", "rt")
my3file = open("tagged_sentences.txt", "w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
str1 = max_sim_string , row[2]
for line in hay.splitlines():
if max_sim_string in line:
tag_sent = line.replace(max_sim_string, str1.__str__())
my3file.writelines(tag_sent + '\n')
print(tag_sent)
break
csvFile.close()
我的当前输出是
he has ('anxiety', ' disorder') thats why he is behaving like that.
('Malaria', ' virus') can be cure
Hello how are you doing. ('Headache', ' symptom') is dangerous
我希望输出为.我希望它在同一文件"sentences.txt"中标记句子中的单词或将其写入新文件"myfile3.txt."而不会打扰句子的顺序或完全忽略(不添加)
I want my output as. i want it tags the words in the sentences in the same file "sentences.txt" or write it in new file "myfile3.txt. without disturbing the order of sentences or totally ignore (not adding) it
Hello how are you doing. ('Headache', 'symptom') is dangerous
('Malaria', ' virus') can be cure.
he has ('anxiety', ' disorder') thats why he is behaving like that
she is doing well
he has psychological problems
推荐答案
无需在代码中进行太多更改,这应该可以使它工作:
Without changing much in your code this should make it work:
...
phrases = []
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
str1 = max_sim_string , row[2]
phrases.append((max_sim_string, row[2]))
for line in hay.splitlines():
if any(max_sim_string in line for max_sim_string, _ in phrases):
for phrase in phrases:
max_sim_string, _ = phrase
if max_sim_string in line:
tag_sent = line.replace(max_sim_string, phrase.__str__())
my3file.writelines(tag_sent + '\n')
print(tag_sent)
break
else:
my3file.writelines(line + '\n')
csvFile.close()
这篇关于使用字典标记标记句子中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文