如何在句子之间进行比较并计算相似度? [英] How to make a comparison between sentences and calculate the similarity?
问题描述
如何比较第二句的第一句和第一句与第三句等,并使用 shell脚本
或 bash
How to make a comparison between the first sentence of the second sentence and the first sentence with the third sentence and so on, and calculate the similarity using shell script
or bash
我有一个包含重复单词的句子,例如文件 my_text.txt
中的输入数据并应忽略每个句子中重复的单词,填充词和非字母字符.
I have a sentences containing duplicate words, for example, the input data in file my_text.txt
and should ignore duplicated words per sentence, filler words, and non-alphabetical characters.
Shell脚本
Linux Shell脚本
Shell或Bash很有趣
Shell Script
Linux Shell Script
Shell or bash are fun
我用这个shell脚本来寻找相似之处
I used this shell script to find similarity
words=$(
< my_text.txt tr 'A-Z' 'a-z' |
grep -Eon '\b[a-z]*\b' |
grep -Fwvf <(printf %s\\n is a to be by the and for) |
sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"
上面的脚本一次计算所有句子的相似度,但是我想找到所有相似度对(例如1:2、1:3、1:4,…,2:3、2:4,…,3:4,...)
The script Above calculates similarity for all sentences one time, but I want to find want all pairs of similarities (e.g. 1:2, 1:3, 1:4, …, 2:3, 2:4, …, 3:4, ...)
我想找到像这样的2个示例的相似之处:
I want to find similarity like this 2 example:
- 第一句话和第二句话:
- 两个句子的交集:
Shell + Script
- 工会"大小两个句子中的一个:
3
- 相似性:
0.66666666
- 第一句话和第三句话:
- 两个句子的交集:
Shell
- 工会"大小两个句子中的一个:
4
- 相似度:
0.25
有人可以帮忙吗?
推荐答案
对我对上一个问题的回答进行了一些细微调整a>,仍将GNU awk用于FPAT和数组数组:
With a small tweak to my answer to your previous question, still using GNU awk for FPAT and arrays of arrays:
$ cat tst.awk
BEGIN {
split("is a to be by the and for",tmp)
for (i in tmp) {
stopwords[tmp[i]]
}
FPAT="[[:alnum:]_]+"
}
{
for (i=1; i<=NF; i++) {
word = tolower($i)
if ( !(word in stopwords) ) {
words[NR>1?2:1][word]
}
}
}
NR > 1 {
numCommon = 0
for (word in words[1]) {
if (word in words[2]) {
numCommon++
}
}
totWords = length(words[1]) + length(words[2]) - numCommon
print (totWords ? numCommon / totWords : 0)
delete words[2]
}
$ awk -f tst.awk file
0.666667
0.166667
这篇关于如何在句子之间进行比较并计算相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!