在大型文本文件中搜索字符串-在python中分析各种方法 [英] Searching for a string in a large text file - profiling various methods in python

查看:1045
本文介绍了在大型文本文件中搜索字符串-在python中分析各种方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题已经问了很多遍了.花了一些时间阅读答案之后,我进行了一些简要的分析,以尝试前面提到的各种方法...

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

  • 我有一个 600 MB 文件,其中包含 600万行字符串(来自DMOZ项目的类别路径).
  • 每一行上的条目都是唯一的.
  • 我要加载文件一次继续搜索以查找数据中的匹配项
  • I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).
  • The entry on each line is unique.
  • I want to load the file once & keep searching for matches in the data

我在下面尝试的三种方法列出了加载文件所需的时间,否定匹配的搜索时间&任务管理器中的内存使用情况

The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager

1) set :
    (i)  data   = set(f.read().splitlines())
    (ii) result = search_str in data   

加载时间〜10s,搜索时间〜0.0s,内存使用量〜1.2GB


2) list :
    (i)  data   = f.read().splitlines()
    (ii) result = search_str in data

加载时间〜6s,搜索时间〜0.36s,内存使用量〜1.2GB


3) mmap :
    (i)  data   = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    (ii) result = data.find(search_str)

加载时间〜0s,搜索时间〜5.4s,内存使用率〜NA


4) Hash lookup (using code from @alienhard below):   

加载时间〜65s,搜索时间〜0.0s,内存使用量〜250MB


5) File search (using code from @EOL below):   
   with open('input.txt') as f:
       print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file

加载时间〜0s,搜索时间〜3.2s,内存使用量〜NA


6) sqlite (with primary index on url): 

加载时间〜0s,搜索时间〜0.0s,内存使用量〜NA


对于我的用例,只要我有足够的可用内存,似乎最好使用该设置.我希望能对这些问题发表一些评论:


For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :

  1. 一个更好的选择,例如sqlite?
  2. 使用mmap改善搜索时间.我有一个64位安装程序. [edit]例如布隆过滤器
  3. 随着文件大小增加到几GB,我有什么办法可以继续使用设置",例如批量拆分..
  1. A better alternative e.g. sqlite ?
  2. Ways to improve the search time using mmap. I have a 64-bit setup. [edit] e.g. bloom filters
  3. As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..

我需要经常搜索,添加/删除值,并且不能单独使用哈希表,因为以后需要检索修改后的值.

[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.

欢迎任何评论/建议!

Any comments/suggestions are welcome !

更新答案中建议的方法的结果 [edit 3]使用sqlite结果更新

[edit 2] Update with results from methods suggested in answers [edit 3] Update with sqlite results

解决方案:基于所有配置文件和费用返还,我想我会选择sqlite.第二种选择是方法4.sqlite的一个缺点是数据库大小是带有url的原始csv文件的两倍多.这是由于url上的主索引

Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url

推荐答案

如果您需要启动许多顺序搜索,变体1很好.由于set在内部是哈希表,因此非常适合搜索.但是,构建需要花费时间,并且只有在您的数据适合RAM的情况下才能正常工作.

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.

Variant 3非常适合大型文件,因为您有足够的地址空间来映射它们,并且OS可以缓存足够的数据.您进行全面扫描;一旦您的数据停止放入RAM,它就会变得相当慢.

Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.

SQLite绝对是一个不错的主意,如果您需要在行中进行多次搜索并且无法将数据放入RAM中.将字符串加载到表中,构建索引,然后SQLite为您构建一个漂亮的b树.即使没有数据,树也可以放入RAM中(有点像@alienhard提出的内容),即使没有,也可以大大减少I/O所需的数量.当然,您需要创建一个基于磁盘的SQLite数据库.我怀疑基于内存的SQLite是否会明显击败Variant 1.

SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what @alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.

这篇关于在大型文本文件中搜索字符串-在python中分析各种方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆