在两个文档之间找到相似的句子,并计算整个文档中每个部分的相似度 [英] Find similar sentences in between two documents and calculate similarity score for each section in whole documents

查看:110
本文介绍了在两个文档之间找到相似的句子,并计算整个文档中每个部分的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从网络上举了这个例子.我的文档之一包含:

文档1:

访问目的:进行身体检查.

患者病史:这是这名56岁女性的首次入院, 她说她直到入院前一个星期都处于正常的健康状态.那时她注意到突然发作(几秒钟到一分钟)的胸痛,她形容为钝痛.疼痛始于左胸骨旁区域,并一直蔓延到她的脖子.

药物:1. Critizin. 2. p.n.b.s

系统回顾:

提示:

每个周末1或2杯啤酒;每周一次,有晚餐,每次一杯酒.

心血管:

请参阅HPI

文档2包含:

访问目的:进行身体检查.

患者病史:这是这名56岁女性的首次入院, 她说她直到入院前一个星期都处于正常的健康状态.那时她注意到突然发作(几秒钟到一分钟)的胸痛,她形容为钝痛.疼痛始于左胸骨旁区域,并一直蔓延到她的脖子.她不抽烟,也没有糖尿病. 她3年前被诊断出患有高血压,6年前曾接受过BSO的TAH.她没有接受激素替代疗法.有过早的CAD家族史.她不知道自己的胆固醇水平.

药物:1. Critizin. 2. Flexon

系统回顾:

提示:

每个周末1或2杯啤酒;每周一次,有晚餐,每次一杯酒.

心血管: 查看HPI

泌尿生殖道: 没有尿痛,夜尿,多尿,血尿或阴道流血的主诉.

我当时正在考虑根据(.)分割文件中的每一行,并根据(:)分割部分.但有时在档案中我也有3.5或在药物部分中,所有药物都用(.)分隔,例如药物1您好. 2嗨

如何计算两个文件的这些部分之间的相似度得分.

解决方案

您可以使用 difflib 模块.

此模块提供用于比较序列的类和函数.例如,它可以用于比较文件,并可以产生各种格式的差异信息,包括HTML和上下文以及统一的diff.要比较目录和文件,另请参见 filecmp 模块.

在您的情况下,您需要 difflib.SequenceMatcher 类,用于比较任何类型的序列对,只要序列元素是可哈希的.

示例:

from difflib import SequenceMatcher
text_1 = "private Thread currentThread;"
text_2 = "private volatile Thread currentThread;"
s = SequenceMatcher(lambda x: x == " ",
                    text_1,
                    text_2)

现在要测量序列的相似性,请使用ratio(),它会在[0, 1]中返回一个float.根据经验, ratio()值超过0.6表示序列是紧密匹配.

>>> s.ratio()
0.8656716417910447

I took this example from web. My document one contains:

Document 1 :

Purpose of visit : For physical check up.

History of patient : This is the first admission for this 56 year old woman, who states she was in her usual state of good health until one week prior to admission. At that time she noticed the abrupt onset (over a few seconds to a minute) of chest pain which she describes as dull and aching in character. The pain began in the left para-sternal area and radiated up to her neck.

Medications : 1. Critizin. 2. p.n.b.s

Review of Systems :

HEENT:

1 or 2 beers each weekend; 1 glass of wine once a week with dinner.

Cadiovascular:

See HPI

Document 2 contains :

Purpose of visit : For physical check up.

History of patient : This is the first admission for this 56 year old woman, who states she was in her usual state of good health until one week prior to admission. At that time she noticed the abrupt onset (over a few seconds to a minute) of chest pain which she describes as dull and aching in character. The pain began in the left para-sternal area and radiated up to her neck. She does not smoke nor does she have diabetes. She was diagnosed with hypertension 3 years ago and had a TAH with BSO 6 years ago. She is not on hormone replacement therapy. There is a family history of premature CAD. She does not know her cholesterol level.

Medications : 1. Critizin. 2. Flexon

Review of Systems :

HEENT:

1 or 2 beers each weekend; 1 glass of wine once a week with dinner.

Cadiovascular: See HPI

Genitourinary: No complaints of dysuria, nocturia, polyuria, hematuria, or vaginal bleeding.

I was thinking split each line in file on the basis of (.) and split section on the basis of (:). But sometimes in file I also have 3.5 or in medicine section all medicine are seprated by (.) like medicine 1 hello. 2 hi.

How I can calculate similarity score between these sections of two files.

解决方案

You can use difflib module.

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

In your case, you need difflib.SequenceMatcher, class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

Sample example:

from difflib import SequenceMatcher
text_1 = "private Thread currentThread;"
text_2 = "private volatile Thread currentThread;"
s = SequenceMatcher(lambda x: x == " ",
                    text_1,
                    text_2)

Now for measuring the similarity of the sequences, use ratio() which returns a float in [0, 1]. As a rule of thumb, a ratio() value over 0.6 means the sequences are close matches.

>>> s.ratio()
0.8656716417910447

这篇关于在两个文档之间找到相似的句子,并计算整个文档中每个部分的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆