从一亿行字符串中搜索一个字符串 [英] Search for a string from 100 million rows of strings

查看:63
本文介绍了从一亿行字符串中搜索一个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些md5哈希的文本文件,其中有1亿行.我有另一个带有数千个md5散列的较小文件.我想找到从这个新的较小文件到旧的较大文件的这些md5哈希值的对应索引.

I have this text file containing some md5 hashes, 100 million rows of them. I have this another smaller file with few thousand md5 hashes. I want to find the corresponding indices of these md5 hashes from this new smaller file to the old bigger file.

最有效的方法是什么?是否可以在15分钟左右的时间内完成?

what is the most efficient way to do it? Is it possible to do it in like 15 mins or so?

我尝试了很多东西,但是它们不起作用.首先,我尝试将更大的数据导入数据库文件,并在md5哈希列上创建索引.创建此哈希将永远花费.我什至不确定这是否会大大提高查询速度.有建议吗?

I have tried lots of things but they do not work. First I tried to import the bigger data to a database file and create an index on the md5 hash column. Creating this hash takes for ever. I am not even sure if this will increase the query speed much. Suggestions?

推荐答案

不要在db中执行此操作-使用简单的程序.

  1. 将小文件中的md5散列读取到内存中的哈希图中,以便快速查找.
  2. 然后一次在大文件中逐行读取md5,并检查该行是否在哈希映射中.
  1. Read the md5 hashes from the small file into a hash map in memory, that allow for fast look-ups.
  2. Then read through the md5's in the big file one row at a time, and check if the row is in the hash map.

哈希图中的平均查找时间应接近 O(1),因此,此过程的处理时间基本上就是您的速度可以读取大文件.

Average look-up time in the hash map ought to be close to O(1), so the process time of this is basically how fast you can read through the big file.

使用这种方法,今天的硬件很容易获得15分钟的时间.

The 15 minutes is easily obtained with today's hardware with this approach.

这篇关于从一亿行字符串中搜索一个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆