如何返回重复的元素的数组的Ruby交集? (问题与骰子系数双字母组) [英] How to return a Ruby array intersection with duplicate elements? (problem with bigrams in Dice Coefficient)
问题描述
我想剧本骰子的系数,但我有一点与阵列相交的问题。
I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.
def bigram(string)
string.downcase!
bgarray=[]
bgstring="%"+string+"#"
bgslength = bgstring.length
0.upto(bgslength-2) do |i|
bgarray << bgstring[i,2]
end
return bgarray
end
def approx_string_match(teststring, refstring)
test_bigram = bigram(teststring) #.uniq
ref_bigram = bigram(refstring) #.uniq
bigram_overlay = test_bigram & ref_bigram
result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100
return result
end
的问题是,如与放大器;删除重复,我得到的东西是这样的:
The problem is, as & removes duplicates, I get stuff like this:
string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"
puts approx_string_match(string1, string2) => 76.0%
它应该返回100。
It should return 100.
该uniq的方法指甲,但没有信息丢失,这可能是我工作的具体数据带来不需要的匹配。
The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.
我怎样才能得到所有重复的交集包含?
How can I get an intersection with all duplicates included?
推荐答案
由于尤瓦˚F
说你应该使用多集
。然而,没有多集
Ruby的标准库,以在看的这里和这里。
As Yuval F
said you should use multiset
. However, there is nomultiset
in Ruby standard library , Take at look at here and here.
如果性能不是您的应用程序关键,你仍然可以用做阵列
带有一点点code。
If performance is not that critical for your application, you still can do it usingArray
with a little bit code.
def intersect a , b
a.inject([]) do |intersect, s|
index = b.index(s)
unless index.nil?
intersect << s
b.delete_at(index)
end
intersect
end
end
a= ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect #["al", "al", "lc"]
这篇关于如何返回重复的元素的数组的Ruby交集? (问题与骰子系数双字母组)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!