如何查找MASSIVE数组中的哪些项出现不止一次? [英] How to find which items in a MASSIVE array appear more than once?

查看:112
本文介绍了如何查找MASSIVE数组中的哪些项出现不止一次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个非常简单的问题;哪些项目多次出现在列表中?

This is a very simple question; which items appear in the list more than once?

array = ["mike", "mike", "mike", "john", "john", "peter", "clark"]

正确答案是["mike", "john"].

似乎我们可以做到:

array.select{ |e| ary.count(e) > 1 }.uniq

问题已解决.可是等等!如果数组真的很大怎么办?

Problems solved. But wait! What if the array is REALLY big:

1_000_000.times { array.concat("1234567890abcdefghijklmnopqrstuvwxyz".split('')) }

碰巧的是,我需要弄清楚如何在合理的时间内完成此操作.我们正在谈论数以百万计的记录.

It just so happens I need to figure out how to do this in a reasonable amount of time. We're talking millions and millions of records.

就其价值而言,这个庞大的数组实际上是10到20个较小数组的总和.如果比较起来比较容易,请告诉我-我很困惑.

For what it's worth, this massive array is actually a sum of 10-20 smaller arrays. If it's easier to compare those, let me know - I'm stumped.

我们正在谈论每个文件10,000至10,000,000行,数百个文件.

We're talking 10,000 to 10,000,000 lines per file, hundreds of files.

推荐答案

类似

items = 30_000_000

array = items.times.map do
  rand(10_000_000)
end

puts "Done with seeding"
puts
puts "Checking what items appear more than once. Size: #{array.size}"
puts

t1 = Time.now
def more_than_once(array)
  counts = Hash.new(0)
  array.each do |item|
    counts[item] += 1
  end

  counts.select do |_, count|
    count > 1
  end.keys
end

res = more_than_once(array)
t2 = Time.now


p res.size
puts "Took #{t2 - t1}"

为您工作?

在我的机器上,持续时间约为40秒.

The duration is about 40s on my machine.

这篇关于如何查找MASSIVE数组中的哪些项出现不止一次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆