如何查找MASSIVE数组中的哪些项出现不止一次? [英] How to find which items in a MASSIVE array appear more than once?
问题描述
这是一个非常简单的问题;哪些项目多次出现在列表中?
This is a very simple question; which items appear in the list more than once?
array = ["mike", "mike", "mike", "john", "john", "peter", "clark"]
正确答案是["mike", "john"]
.
似乎我们可以做到:
array.select{ |e| ary.count(e) > 1 }.uniq
问题已解决.可是等等!如果数组真的很大怎么办?
Problems solved. But wait! What if the array is REALLY big:
1_000_000.times { array.concat("1234567890abcdefghijklmnopqrstuvwxyz".split('')) }
碰巧的是,我需要弄清楚如何在合理的时间内完成此操作.我们正在谈论数以百万计的记录.
It just so happens I need to figure out how to do this in a reasonable amount of time. We're talking millions and millions of records.
就其价值而言,这个庞大的数组实际上是10到20个较小数组的总和.如果比较起来比较容易,请告诉我-我很困惑.
For what it's worth, this massive array is actually a sum of 10-20 smaller arrays. If it's easier to compare those, let me know - I'm stumped.
我们正在谈论每个文件10,000至10,000,000行,数百个文件.
We're talking 10,000 to 10,000,000 lines per file, hundreds of files.
推荐答案
类似
items = 30_000_000
array = items.times.map do
rand(10_000_000)
end
puts "Done with seeding"
puts
puts "Checking what items appear more than once. Size: #{array.size}"
puts
t1 = Time.now
def more_than_once(array)
counts = Hash.new(0)
array.each do |item|
counts[item] += 1
end
counts.select do |_, count|
count > 1
end.keys
end
res = more_than_once(array)
t2 = Time.now
p res.size
puts "Took #{t2 - t1}"
为您工作?
在我的机器上,持续时间约为40秒.
The duration is about 40s on my machine.
这篇关于如何查找MASSIVE数组中的哪些项出现不止一次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!