从大型(密码)列表中聚合和删除重复项的有效方法 [英] Efficient way to aggregate and remove duplicates from very large (password) lists

查看:75
本文介绍了从大型(密码)列表中聚合和删除重复项的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文:

  • 我正在尝试将大量单独的密码列表文本文件组合到一个文件中,以用于基于字典的密码破解.

  • I am attempting to combine a large amount of separate password list text files into a single file for use in dictionary based password cracking.

每个文本文件都是行定界的(每行一个密码),目前有82个单独的文件.大多数(66)文件在1-100Mb的文件大小范围内,12个在100-700Mb的文件范围内,3个是2Gb的文件,1个(最有问题的)文件是11.2Gb.

Each text file is line delimited (a single password per line) and there are 82 separate files at the moment. Most (66) files are in the 1-100Mb filesize range, 12 are 100-700Mb, 3 are 2Gb, and 1 (the most problematic) is 11.2Gb.

我估计总共需要处理17.5亿个非唯一密码;我估计其中约有4.5亿(%25)将重复,最终需要丢弃.

In total I estimate 1.75 billion non-unique passwords need processing; of these I estimate ~450 million (%25) will be duplicates and ultimately need to be discarded.

我正在尝试在具有6GB以上可用RAM的设备(即8Gb已消耗2Gb)的设备上执行此操作.

I am attempting to do this on a device which has a little over 6Gb of RAM free to play with (i.e. 8Gb with 2Gb already consumed).

问题:

我需要一种方法:a)将所有这些密码汇总在一起,并b)在我的RAM内存限制内并在合理的时间内(〜7天,理想情况下要少得多)删除确切的重复项,但实际上我不在乎是否需要个星期,然后再也不需要运行它了).时间窗口.

I need a way to a) aggregate all of these passwords together and b) remove exact duplicates, within my RAM memory constrains and within a reasonable (~7 days, ideally much less but really I don't care if it takes weeks and then I never need to run it again) time window.

我是一位能干的Python程序员,因此已经多次破解了.我最成功的尝试是使用sqlite3在磁盘进行过程中将已处理的密码存储在硬盘上.但是,这意味着通过散列每个已完成的文件并在每次打开新文件时都进行维护/比较来单调乏味地跟踪各个处理实例之间已经完成了哪些文件(我取消并重新启动了几次以进行更改).但是,对于非常大的文件,任何进度都将丢失.

I am a competent Python programmer and thus gave it a crack several times already. My most successful attempt used sqlite3 to store processed passwords on the hard disk as it progressed. However this meant that keeping track of which files had already been completed between processing instances (I cancelled and restarted several times to make changes) was tediously achieved by hashing each completed file and maintaining/comparing these each time a new file was opened. For the very large files however, any progress would be lost.

我一次以大约10亿(最多)行的块为单位处理文本文件,以防止内存耗尽,并且长时间没有反馈.我知道,由于有很多时间可以完全填充数据库,因为我在24小时的运行时间内实现了〜4.5Gb的数据库文件大小,因此我估计剩下的时间最多要花4天才能完成所有工作,但是我不知道是否/如何最有效地对其进行读/写操作,也不知道如何解决删除重复项的任何好主意(在我填充数据库时还是做完之后做更多的遍……?在我不知道的数据库配置中,有没有一种更快的方法来查找唯一性?).

I was processing the text files in blocks of ~1 billion (at most) lines at a time to prevent memory exhaustion without having no feedback for extended periods of time. I know that I could, given a lot of time populate my database fully as I achieved a DB filesize of ~4.5Gb in 24 hours of runtime so I estimate that left to run it would take about 4 days at most to get through everything, but I don't know if/how to most efficiently read/write to it nor do I have any good ideas on how to tackle the removing of duplicates (do it as I am populating the DB or make additional passes afterwards...? Is there a much faster means of doing lookups for uniqueness in a database configuration I don't know about?).

我今天的要求是寻求有关编程和优化方法的建议/解决方案,以实现我庞大而独特的密码列表(最好是使用Python).如果我已经偏离标准,我完全愿意采取完全不同的方针.

My request here today is for advice / solutions to a programming and optimisation approach on how to achieve my giant, unique password list (ideally with Python). I am totally open to taking a completely different tack if I am off the mark already.

有两个不错的选择:

  • 一种将来无需添加整个列表即可添加更多密码的方法;和

  • A way to add more passwords in the future without having to rebuild the whole list; and

数据库<所有这些操作都在20Gb结束时完成,因此移动起来不会有很大的麻烦.

A database < 20Gb at the end of all this so that it isn't a huge pain to move around.

解决方案

基于CL的解决方案,该解决方案最终比我想出的略微修改的方法优雅得多.

Based on CL's solution which was ultimately a lot more elegant than what I was thinking I came up with a slightly modified method.

按照CL的建议,我设置了一个sqlite3数据库,并将文本文件输入到Python脚本中,该脚本消耗了它们,然后输出命令以将其插入到DB中.这项工作虽然直接完成,但是非常慢(不可行).

Following CL's advice I setup a sqlite3 DB and fed the text files into a Python script which consumed them and then output a command to insert them into the DB. Straight off the bat this ~did~ work but was extremely (infeasibly) slow.

我通过一些简单的数据库优化解决了这个问题,该优化更容易实现,坦率地说,更干净地完成了下面包含的基于CL框架代码的核心Python脚本中的所有操作.原始代码正在生成 sooooooo 的许多I/O操作这一事实在我的(Win7)操作系统上引起了一些奇怪的事情,从而导致BSOD和数据丢失.我通过使整个密码文件的插入成为一个SQL事务加上一些编译指示更改来解决了这一问题.最后,该代码以大约30,000次插入/秒的速度运行,这不是最好的,但对于我的目的而言当然可以接受.

I solved this by a few simple DB optimisations which was much easier to implement and frankly cleaner to just do all from the core Python script included below which builds upon CL's skeleton code. The fact that the original code was generating sooooooo many I/O operations was causing something funny on my (Win7) OS causing BSODs and lost data. I solved this by making the insertion of a whole password file one SQL transaction plus a couple of pragma changes. In the end the code runs at about 30,000 insertions / sec which is not the best but is certainly acceptable for my purposes.

在最大的文件上仍然可能会失败,但是如果/在这种情况下,我将简单地将文件分割成较小的1Gb部分并单独使用.

It may be the case that this will still fail on the largest of files but if/when that is the case, I will simply chunk the file down into smaller 1Gb portions and consume them individually.

import sys
import apsw

i = 0
con = apsw.Connection("passwords_test.db")
cur = con.cursor()

cur.execute("CREATE TABLE IF NOT EXISTS Passwords(password TEXT PRIMARY KEY) WITHOUT ROWID;")
cur.execute("PRAGMA journal_mode = MEMORY;")
cur.execute("PRAGMA synchronous = OFF;")

cur.execute("BEGIN TRANSACTION")
for line in sys.stdin:
    escaped = line.rstrip().replace("'", "''")
    cur.execute("INSERT OR IGNORE INTO Passwords VALUES(?);", (escaped,))
    i += 1
    if i % 100000 == 0: # Simple line counter to show how far through a file we are
        print i

cur.execute("COMMIT")
con.close(True)

然后从命令行运行此代码:

This code is then run from command line:

insert_passwords.py < passwordfile1.txt

并通过以下方式自动实现:

And automated by:

for %%f in (*.txt) do (
insert_passwords.py < %%f
)

总而言之,DB文件本身并没有增长太快,插入率足够,我可以一口气就中断/恢复操作,准确地丢弃重复的值,而当前的限制因素是数据库的查找速度,而不是CPU或磁盘空间.

All in all, the DB file itself is not growing too quickly, the insertion rate is sufficient, I can break/resume operations at the drop of a hat, duplicate values are being accurately discarded, and the current limiting factor is the lookup speed of the DB not the CPU or disk space.

推荐答案

在将密码存储在SQL数据库中时,能够检测重复项需要一个索引. 这意味着密码在表和索引中存储了两次.

When storing the passwords in an SQL database, being able to detect duplicates requires an index. This implies that the passwords are stored twice, in the table and in the index.

但是,SQLite 3.8.2 或更高版本支持

However, SQLite 3.8.2 or later supports WITHOUT ROWID tables (called "clustered index" or "index-organized tables" in other databases), which avoid the separate index for the primary key.

没有Python版本已包含SQLite 3.8.2. 如果您没有使用 APSW ,您仍然可以使用Python创建SQL命令:

There is no Python version that already has SQLite 3.8.2 included. If you are not using APSW, you can still use Python to create the SQL commands:

  1. 安装最新的sqlite3命令行外壳程序(下载页面).
  2. >
  3. 创建数据库表:

  1. Install the newest sqlite3 command-line shell (download page).
  2. Create a database table:

$ sqlite3 passwords.db
SQLite version 3.8.5 2014-06-02 21:00:34
Enter ".help" for usage hints.
sqlite> CREATE TABLE MyTable(password TEXT PRIMARY KEY) WITHOUT ROWID;
sqlite> .exit

  • 创建一个Python脚本来创建INSERT语句:

  • Create a Python script to create the INSERT statements:

    import sys
    print "BEGIN;"
    for line in sys.stdin:
        escaped = line.rstrip().replace("'", "''")
        print "INSERT OR IGNORE INTO MyTable VALUES('%s');" % escaped
    print "COMMIT;"
    

    (如果重复项会违反主键的唯一约束,则INSERT OR IGNORE语句将不会插入行.)

    (The INSERT OR IGNORE statement will not insert a row if a duplicate would violate the primary key's unique constraint.)

    通过将命令通过管道传递到数据库外壳中来插入密码:

    Insert the passwords by piping the commands into the database shell:

    $ python insert_passwords.py < passwords.txt | sqlite3 passwords.db
    

  • 无需拆分输入文件;更少的交易可以减少开销.

    There is no need to split up input files; fewer transaction have less overhead.

    这篇关于从大型(密码)列表中聚合和删除重复项的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆