在大文本文件中搜索字符串——分析python中的各种方法 [英] Searching for a string in a large text file - profiling various methods in python

查看:43
本文介绍了在大文本文件中搜索字符串——分析python中的各种方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题已经被问过很多次了.在花了一些时间阅读答案后,我做了一些快速分析以尝试前面提到的各种方法......

<块引用>
  • 我有一个 600 MB 文件,其中包含 600 万 行字符串(来自 DMOZ 项目的类别路径).
  • 每一行的条目都是唯一的.
  • 我想加载文件一次 &继续搜索数据中的匹配项

我在下面尝试的三种方法列出了加载文件所需的时间、搜索否定匹配的时间以及任务管理器中的内存使用

<小时>

1) 设置:(i) 数据 = 设置(f.read().splitlines())(ii) result = search_str in data

<块引用>

加载时间~10s,搜索时间~0.0s,内存使用~1.2GB

<小时>

2) 列表:(i) 数据 = f.read().splitlines()(ii) result = search_str in data

<块引用>

加载时间~6s,搜索时间~0.36s,内存使用~1.2GB

<小时>

3) mmap :(i) 数据 = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)(ii) 结果 = data.find(search_str)

<块引用>

加载时间 ~ 0s,搜索时间 ~ 5.4s,内存使用 ~ NA

<小时>

4) 哈希查找(使用下面@alienhard 的代码):

<块引用>

加载时间~65s,搜索时间~0.0s,内存使用~250MB

<小时>

5) 文件搜索(使用下面@EOL 的代码):with open('input.txt') as f:print search_str in f #search_str 以 ('
' 或 '
') 结尾,与文件中一样

<块引用>

加载时间~0s,搜索时间~3.2s,内存使用~NA

<小时>

6) sqlite(在 url 上有主索引):

<块引用>

加载时间~0s,搜索时间~0.0s,内存使用~NA

<小时>

对于我的用例,只要我有足够的可用内存,使用 set 似乎是最好的选择.我希望得到一些关于这些问题的评论:

<块引用>

  1. 更好的选择sqlite ?
  2. 使用 mmap 改进搜索时间的方法.我有一个 64 位设置. 例如布隆过滤器
  3. 随着文件大小增长到几 GB,有什么办法可以继续使用set",例如分批拆分..

我需要频繁搜索,添加/删除值,不能单独使用哈希表,因为我需要稍后检索修改后的值.

欢迎任何评论/建议!

更新答案中建议的方法的结果 用 sqlite 结果更新

解决方案:基于所有分析和反馈,我想我会和sqlite一起去.第二种选择是方法 4.sqlite 的一个缺点是数据库大小是带有 url 的原始 csv 文件的两倍多.这是由于 url 上的主索引

解决方案

如果您需要启动许多顺序搜索,变体 1 非常有用.由于 set 内部是一个哈希表,所以比较擅长搜索.但是,构建需要时间,并且只有在您的数据适合 RAM 时才能正常工作.

变体 3 适用于非常大的文件,因为您有足够的地址空间来映射它们,并且操作系统缓存了足够的数据.您进行全面扫描;一旦您的数据停止适合 RAM,它会变得相当慢.

如果您需要连续进行多次搜索并且无法将数据放入 RAM,SQLite 绝对是一个好主意.将您的字符串加载到表中,构建索引,SQLite 会为您构建一个漂亮的 b 树.即使数据不适合,树也可以放入 RAM 中(这有点像 @alienhard 提出的),即使不适合,所需的 I/O 数量也会大大降低.当然,您需要创建一个基于磁盘的 SQLite 数据库.我怀疑基于内存的 SQLite 会明显击败 Variant 1.

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

  • I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).
  • The entry on each line is unique.
  • I want to load the file once & keep searching for matches in the data

The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager


1) set :
    (i)  data   = set(f.read().splitlines())
    (ii) result = search_str in data   

Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB


2) list :
    (i)  data   = f.read().splitlines()
    (ii) result = search_str in data

Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB


3) mmap :
    (i)  data   = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    (ii) result = data.find(search_str)

Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA


4) Hash lookup (using code from @alienhard below):   

Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB


5) File search (using code from @EOL below):   
   with open('input.txt') as f:
       print search_str in f #search_str ends with the ('
' or '
') as in the file

Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA


6) sqlite (with primary index on url): 

Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA


For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :

  1. A better alternative e.g. sqlite ?
  2. Ways to improve the search time using mmap. I have a 64-bit setup. [edit] e.g. bloom filters
  3. As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..

[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.

Any comments/suggestions are welcome !

[edit 2] Update with results from methods suggested in answers [edit 3] Update with sqlite results

Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url

解决方案

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.

Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.

SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what @alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.

这篇关于在大文本文件中搜索字符串——分析python中的各种方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆