读取大文件时,Python内存错误,在以下情况下需要思路进行多处理吗? [英] Python Memory Error when reading large files , need ideas to apply mutiprocessing in below case?

查看:168
本文介绍了读取大文件时,Python内存错误,在以下情况下需要思路进行多处理吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,该文件以以下格式存储数据

I have the file which stores the data in the below format

TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]
TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]

我发现以下方法可以满足此问题中给出的需求

Below method I found to fulfill the need given in this question please refer this link for clarification

import re
from itertools import groupby

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> str:
    match = regex.match(arg)
    if match:
        return match.group(1)
    return ""


def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0
with open(your_input_file) as fr:
    collection = filter(func1, fr)
    collection = sorted(collection, key=func2)
    collection = sorted(collection, key=func3)
    for key, group in groupby(collection, key=func3):
        with open(f"ROLL_{key}", mode="w") as fw:
            fw.writelines(group)

上面的函数也根据我的意愿创建文件,它对file_contents进行排序 根据时间戳记,我得到正确的输出,因此我尝试将其用于1.7 GB的大文件,这会导致内存错误,我尝试使用以下方法

The above function is creating the files according to my wish also , it's sorting the file_contents according to time stamps and I am getting correct output so i tried it for large files of the size 1.7 GB it's giving memory error I tried to use the following method

尝试失败:

  with open(my_file.txt) as fr:
        part_read = partial(fr.read, 1024 * 1024)
        iterator = iter(part_read, b'')
        for index, fra in enumerate(iterator, start=1):
         collection = filter(func1, fra)
         collection = sorted(collection, key=func2)
         collection = sorted(collection, key=func3)
         for key, group in groupby(collection, key=func3):
            fw=open(f'ROLL_{key}.txt','a')
            fw.writelines(group)

这种尝试没有给我任何结果,这意味着根本没有创建任何文件,这花费了我极大的时间,我在许多答案中发现要逐行读取文件,然后再对文件进行排序,请提出建议如果需要在此处使用多处理来更快地处理,则可以改进此代码或任何新想法的建议,如果是这样的话,如何使用它?

This attempt doesn't gave me any results means there was no file created at all it's taking unexpectedly huge time , i found in many of the answers to read file line by line then how I will then sort it , please suggest me suggestions to improve this code or any new idea if I need to use multiprocessing here to process faster ,if that is the case How to use it?

我的一个主要条件是我不能存储任何数据结构,因为 文件可能很大

And One main condition with me is I can't store it any data structure since file can be huge in size

推荐答案

如果要按块读取文件,请使用以下命令:

And if you want read file by chunk, use this:

import re
from functools import partial
from itertools import groupby
from typing import Tuple

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> Tuple[str, int]:
    match = regex.match(arg)
    if match:
        return match.group(1), int(match.group(2))
    return "", 0

def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0

def read_in_chunks(file_object, chunk_size=1024*1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

with open('b.txt') as fr:
    for chunk in read_in_chunks(fr):
        collection = filter(func1, chunk.splitlines())
        collection = sorted(collection, key=func2)
        for key, group in groupby(collection, key=func3):
            with open(f"ROLL_{key}", mode="wa") as fw:
                fw.writelines(group)

这篇关于读取大文件时,Python内存错误,在以下情况下需要思路进行多处理吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆