如何在Python中解析大于100GB的文件? [英] How to parse files larger than 100GB in Python?

查看:192
本文介绍了如何在Python中解析大于100GB的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下格式的文本文件,大小约为100 Gb(具有行,ips和域的重复记录):

I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :

domain|ip 
yahoo.com|89.45.3.5
bbc.com|45.67.33.2
yahoo.com|89.45.3.5
myname.com|45.67.33.2
etc.

我正在尝试使用以下python代码解析它们,但仍然出现内存错误.有人知道解析此类文件的最佳方法吗? (时间对我来说很重要)

I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)

files = glob(path)
for filename in files:
    print(filename)
    with open(filename) as f:
        for line in f:
            try:
                domain = line.split('|')[0]
                ip = line.split('|')[1].strip('\n')
                if ip in d:
                    d[ip].add(domain)
                else:
                    d[ip] = set([domain])
            except:
                print (line)
                pass

    print("this file is finished")

for ip, domains in d.iteritems():
    for domain in domains:
        print("%s|%s" % (ip, domain), file=output)

推荐答案

Python对象占用的内存比磁盘上相同值占用的内存多;引用计数的开销很小,而且在集合中还需要考虑每个值的缓存的哈希值.

Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.

不要将所有这些对象读入(Python)内存;改用数据库. Python随附了一个用于SQLite数据库的库,可使用该库将文件转换为数据库.然后,您可以从中构建输出文件:

Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:

import csv
import sqlite3
from itertools import islice

conn = sqlite3.connect('/tmp/ipaddresses.db')
conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
conn.execute('''\
    CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx 
    ON ipaddress(domain, ip)''')

for filename in files:
    print(filename)
    with open(filename, 'rb') as f:
        reader = csv.reader(f, delimiter='|')
        cursor = conn.cursor()
        while True:
            with conn:
                batch = list(islice(reader, 10000))
                if not batch:
                    break
                cursor.executemany(
                    'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
                    batch)

conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
with open(outputfile, 'wb') as outfh:
    writer = csv.writer(outfh, delimiter='|')
    cursor = conn.cursor()
    cursor.execute('SELECT ip, domain from ipaddress order by ip')
    writer.writerows(cursor)

这将以10000个批次处理您的输入数据,并生成一个索引,以便在插入进行排序.产生索引将需要一些时间,但所有索引都将适合您的可用内存.

This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.

在开始时创建的UNIQUE索引可确保仅插入唯一的域-ip地址对(因此仅跟踪每个ip地址的唯一域); INSERT OR IGNORE语句跳过数据库中已经存在的任何对.

The UNIQUE index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); the INSERT OR IGNORE statement skips any pair that is already present in the database.

仅提供您输入的示例的简短演示:

Short demo with just the sample input you gave:

>>> import sqlite3
>>> import csv
>>> import sys
>>> from itertools import islice
>>> conn = sqlite3.connect('/tmp/ipaddresses.db')
>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
<sqlite3.Cursor object at 0x106c62730>
>>> conn.execute('''\
...     CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx 
...     ON ipaddress(domain, ip)''')
<sqlite3.Cursor object at 0x106c62960>
>>> reader = csv.reader('''\
... yahoo.com|89.45.3.5
... bbc.com|45.67.33.2
... yahoo.com|89.45.3.5
... myname.com|45.67.33.2
... '''.splitlines(), delimiter='|')
>>> cursor = conn.cursor()
>>> while True:
...     with conn:
...         batch = list(islice(reader, 10000))
...         if not batch:
...             break
...         cursor.executemany(
...             'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
...             batch)
... 
<sqlite3.Cursor object at 0x106c62810>
>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
<sqlite3.Cursor object at 0x106c62960>
>>> writer = csv.writer(sys.stdout, delimiter='|')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')
<sqlite3.Cursor object at 0x106c627a0>
>>> writer.writerows(cursor)
45.67.33.2|bbc.com
45.67.33.2|myname.com
89.45.3.5|yahoo.com

这篇关于如何在Python中解析大于100GB的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆