在大型文本文件中搜索字符串-在python中分析各种方法 [英] Searching for a string in a large text file - profiling various methods in python

查看：1045 发布时间：2020/4/29 3:28:01 python performance search profiling large-files

本文介绍了在大型文本文件中搜索字符串-在python中分析各种方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这个问题已经问了很多遍了.花了一些时间阅读答案之后，我进行了一些简要的分析，以尝试前面提到的各种方法...

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

我有一个 600 MB 文件，其中包含 600万行字符串(来自DMOZ项目的类别路径).

每一行上的条目都是唯一的.

我要加载文件一次并继续搜索以查找数据中的匹配项

I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).

The entry on each line is unique.

I want to load the file once & keep searching for matches in the data

我在下面尝试的三种方法列出了加载文件所需的时间，否定匹配的搜索时间&任务管理器中的内存使用情况

The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager

1) set :
    (i)  data   = set(f.read().splitlines())
    (ii) result = search_str in data

加载时间〜10s，搜索时间〜0.0s，内存使用量〜1.2GB

2) list :
    (i)  data   = f.read().splitlines()
    (ii) result = search_str in data

加载时间〜6s，搜索时间〜0.36s，内存使用量〜1.2GB

3) mmap :
    (i)  data   = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    (ii) result = data.find(search_str)

加载时间〜0s，搜索时间〜5.4s，内存使用率〜NA

4) Hash lookup (using code from @alienhard below):

加载时间〜65s，搜索时间〜0.0s，内存使用量〜250MB

5) File search (using code from @EOL below):   
   with open('input.txt') as f:
       print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file

加载时间〜0s，搜索时间〜3.2s，内存使用量〜NA

6) sqlite (with primary index on url):

加载时间〜0s，搜索时间〜0.0s，内存使用量〜NA

对于我的用例，只要我有足够的可用内存，似乎最好使用该设置.我希望能对这些问题发表一些评论:

For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :

一个更好的选择，例如sqlite?
要使用mmap改善搜索时间.我有一个64位安装程序. [edit]例如布隆过滤器
随着文件大小增加到几GB，我有什么办法可以继续使用设置"，例如批量拆分..

A better alternative e.g. sqlite ?
Ways to improve the search time using mmap. I have a 64-bit setup. [edit] e.g. bloom filters
As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..

我需要经常搜索，添加/删除值，并且不能单独使用哈希表，因为以后需要检索修改后的值.

[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.

欢迎任何评论/建议！

Any comments/suggestions are welcome !

更新答案中建议的方法的结果 [edit 3]使用sqlite结果更新

[edit 2] Update with results from methods suggested in answers [edit 3] Update with sqlite results

解决方案:基于所有配置文件和费用返还，我想我会选择sqlite.第二种选择是方法4.sqlite的一个缺点是数据库大小是带有url的原始csv文件的两倍多.这是由于url上的主索引

Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url

在大型文本文件中搜索字符串-在python中分析各种方法 [英] Searching for a string in a large text file - profiling various methods in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在大型文本文件中搜索字符串-在python中分析各种方法 [英] Searching for a string in a large text file - profiling various methods in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭