使用python解析大文件(9GB) [英] Parsing large (9GB) file using python

查看:74
本文介绍了使用python解析大文件(9GB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的文本文件,需要使用python解析为管道分隔的文本文件.该文件如下所示(基本上):

I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):

product/productId: D7SDF9S9 
review/userId: asdf9uas0d8u9f 
review/score: 5.0 
review/some text here

product/productId: D39F99 
review/userId: fasd9fasd9f9f 
review/score: 4.1 
review/some text here

每条记录由两个换行符/n分隔.我在下面写了一个解析器.

Each record is separated by two newline charters /n. I have written a parser below.

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()

allsplits = re.split("\n\n",fullstr)

articles = []

for i,s in enumerate(allsplits[0:]):

        splits = re.split("\n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]

fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")

return 

问题是我正在读取的文件太大,以至于我无法用完内存才能完成.
我怀疑它在allsplits = re.split("\n\n",fullstr)行上冒出来.
有人可以让我知道一次只读取一个记录,解析,将其写入文件然后移至下一个记录的方法吗?

The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr) line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?

推荐答案

不要一次性将整个文件读入内存;利用这些换行符来生成记录.使用 csv模块编写数据以便于编写删除您用竖线分隔的记录.

Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

下面的代码一次读取输入文件行,并随着记录写出每条记录的CSV行.它在内存中永远不会保存多于一行,并且正在构造一条记录.

The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

此代码确实假定文件包含以category/key形式的冒号之前的文本,因此product/productIdreview/userId等.斜杠后的部分用于CSV列;顶部的fields列表反映了这些键.

This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

或者,您可以删除该fields列表并改用csv.writer,而是将记录值收集在一个列表中:

Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

此版本要求记录字段全部存在,并以固定顺序写入文件.

This version requires that record fields are all present and are written to the file in a fixed order.

这篇关于使用python解析大文件(9GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆