读取大文件时，Python内存错误，在以下情况下需要思路进行多处理吗? [英] Python Memory Error when reading large files , need ideas to apply mutiprocessing in below case?

查看：168 发布时间：2020/7/8 11:27:35 python python-3.x sorting python-multiprocessing file-handling

本文介绍了读取大文件时，Python内存错误，在以下情况下需要思路进行多处理吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件，该文件以以下格式存储数据

I have the file which stores the data in the below format

TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]
TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]

我发现以下方法可以满足此问题中给出的需求

Below method I found to fulfill the need given in this question please refer this link for clarification

import re
from itertools import groupby

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> str:
    match = regex.match(arg)
    if match:
        return match.group(1)
    return ""


def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0
with open(your_input_file) as fr:
    collection = filter(func1, fr)
    collection = sorted(collection, key=func2)
    collection = sorted(collection, key=func3)
    for key, group in groupby(collection, key=func3):
        with open(f"ROLL_{key}", mode="w") as fw:
            fw.writelines(group)

上面的函数也根据我的意愿创建文件，它对file_contents进行排序根据时间戳记，我得到正确的输出，因此我尝试将其用于1.7 GB的大文件，这会导致内存错误，我尝试使用以下方法

The above function is creating the files according to my wish also , it's sorting the file_contents according to time stamps and I am getting correct output so i tried it for large files of the size 1.7 GB it's giving memory error I tried to use the following method

尝试失败:

  with open(my_file.txt) as fr:
        part_read = partial(fr.read, 1024 * 1024)
        iterator = iter(part_read, b'')
        for index, fra in enumerate(iterator, start=1):
         collection = filter(func1, fra)
         collection = sorted(collection, key=func2)
         collection = sorted(collection, key=func3)
         for key, group in groupby(collection, key=func3):
            fw=open(f'ROLL_{key}.txt','a')
            fw.writelines(group)

这种尝试没有给我任何结果，这意味着根本没有创建任何文件，这花费了我极大的时间，我在许多答案中发现要逐行读取文件，然后再对文件进行排序，请提出建议如果需要在此处使用多处理来更快地处理，则可以改进此代码或任何新想法的建议，如果是这样的话，如何使用它?

This attempt doesn't gave me any results means there was no file created at all it's taking unexpectedly huge time , i found in many of the answers to read file line by line then how I will then sort it , please suggest me suggestions to improve this code or any new idea if I need to use multiprocessing here to process faster ,if that is the case How to use it?

我的一个主要条件是我不能存储任何数据结构，因为文件可能很大

And One main condition with me is I can't store it any data structure since file can be huge in size

推荐答案

如果要按块读取文件，请使用以下命令:

And if you want read file by chunk, use this:

import re
from functools import partial
from itertools import groupby
from typing import Tuple

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> Tuple[str, int]:
    match = regex.match(arg)
    if match:
        return match.group(1), int(match.group(2))
    return "", 0

def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0

def read_in_chunks(file_object, chunk_size=1024*1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

with open('b.txt') as fr:
    for chunk in read_in_chunks(fr):
        collection = filter(func1, chunk.splitlines())
        collection = sorted(collection, key=func2)
        for key, group in groupby(collection, key=func3):
            with open(f"ROLL_{key}", mode="wa") as fw:
                fw.writelines(group)

这篇关于读取大文件时，Python内存错误，在以下情况下需要思路进行多处理吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取大文件时，Python内存错误，在以下情况下需要思路进行多处理吗? [英] Python Memory Error when reading large files , need ideas to apply mutiprocessing in below case?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

读取大文件时，Python内存错误，在以下情况下需要思路进行多处理吗? [英] Python Memory Error when reading large files , need ideas to apply mutiprocessing in below case?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭