通过sqlalchemy重复插入sqlite数据库会导致内存泄漏? [英] Repeated insertions into sqlite database via sqlalchemy causing memory leak?

查看:83
本文介绍了通过sqlalchemy重复插入sqlite数据库会导致内存泄漏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当通过sqlalchemy和pandas to_sql和指定的chucksize将巨大的pandas数据帧插入sqlite时,会出现内存错误.

When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.

起初,我认为这是to_sql的问题,但是我尝试了一种变通方法,其中我没有使用块大小,而是使用了for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...),但仍然会导致错误.

At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.

在某些情况下,似乎存在内存泄漏,并且通过sqlalchemy重复插入sqlite.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

通过一个最小的示例,我很难尝试复制在转换数据时发生的内存泄漏.但是,这已经很接近了.

I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

这是在Google Colab CPU环境上运行的.

This was run on Google Colab CPU enviroment.

数据库本身并不会引起内存泄漏,因为我可以重新启动环境,并且以前插入的数据仍然存在,并且连接到该数据库不会导致内存增加.问题似乎是在某些情况下,通过循环to_sql或指定了chucksize的一个to_sql重复插入.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.

有没有一种方法可以运行此代码而不会导致内存使用量的最终增加?

Is there a way that this code could be run without causing an eventual increase in memory usage?

要完全重现该错误,请运行此笔记本

To fully reproduce the error, run this notebook

https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS

笔记本需要您将此文件夹导入到Google云端硬盘的主目录中

The notebook requires you to import this folder into the main directory of your Google Drive

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

笔记本还将安装您的Google驱动器,您需要授予其访问Google驱动器的权限.由于数据托管在我的Google驱动器上,因此导入数据不会占用您分配的任何数据.

The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.

推荐答案

Google Colab实例从大约12.72GB的可用RAM开始. 创建DataFrame theBigList后,已使用约9.99GB的RAM. 这已经是一种相当不舒服的情况,因为对于 熊猫操作需要与其所操作的DataFrame一样多的额外空间. 因此,我们应该努力避免使用尽可能多的RAM,幸运的是,有一种简单的方法可以做到这一点:只需加载每个.npy文件并将其数据一次存储在sqlite数据库中一次,而无需创建theBigList (请参见下文).

The Google Colab instance starts with about 12.72GB of RAM available. After creating the DataFrame, theBigList, about 9.99GB of RAM have been used. Already this is a rather uncomfortable situation to be in, since it is not unusual for Pandas operations to require as much additional space as the DataFrame it is operating on. So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).

但是,如果我们使用您发布的代码,则可以看到RAM使用率缓慢增加 因为theBigList的大块迭代地存储在数据库中.

However, if we use the code you posted, we can see that the RAM usage slowly increases as chunks of theBigList is stored in the database iteratively.

theBigList DataFrame将字符串存储在NumPy数组中.但是在过程中 将字符串传输到sqlite数据库的过程中,NumPy字符串是 转换成Python字符串.这会占用更多内存.

theBigList DataFrame stores the strings in a NumPy array. But in the process of transferring the strings to the sqlite database, the NumPy strings are converted into Python strings. This takes additional memory.

此Theano辅导课讨论了Python内部内存管理,

Per this Theano tutoral which discusses Python internal memory management,

为了加快内存分配(和重用)的速度,Python使用了许多列表 小物件.每个列表将包含大小相似的对象: 列出对象的大小,范围为1到8个字节,一个为9到16个字节,以此类推. 需要创建,或者我们重复使用列表中的空闲块,或者分配一个 新的.

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

...重要的是这些列表永远不会缩小.

... The important point is that those lists never shrink.

确实:如果某项(尺寸为x)被释放(由于缺乏参考而释放),则其 位置不会返回到Python的全局内存池(甚至还不会返回到 系统),但仅标记为免费,并添加到大小免费的商品列表中 X.如果另一个兼容的对象,则该死对象的位置将被重用 大小是必需的.如果没有可用的失效对象,则会创建新的失效对象.

Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.

如果从不释放小对象的内存,那么不可避免的结论是, 像金鱼一样,这些小的对象列表只会持续增长,永远不会缩小, 并且应用程序的内存占用量最大 在任何给定点分配的小对象的数量.

If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.

我相信这可以准确地描述您在执行此循环时看到的行为:

I believe this accurately describes the behavior you are seeing as this loop executes:

for i in range(0, 588):
    theBigList.iloc[i*10000:(i+1)*10000].to_sql(
        'CS_table', engine, index=False, if_exists='append')

即使许多死对象的位置都被重新用于新字符串,它仍然是 对于本质上随机的字符串(如theBigList中的那些字符串)来说,这并不令人难以置信 需要,因此内存占用量不断增长.

Even though many dead objects' locations are being reused for new strings, it is not implausible with essentially random strings such as those in theBigList that extra space will occasionally be needed and so the memory footprint keeps growing.

该过程最终达到了Google Colab的12.72GB RAM限制,并且内核由于内存错误而被终止.

The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.

在这种情况下,避免使用大量内存的最简单方法是永远不要实例化整个DataFrame,而是一次只加载和处理DataFrame的一小块:

In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:

import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA

def load_and_store(dbpath):
    engine = SA.create_engine("sqlite:///{}".format(dbpath))    
    for i in range(0, 47):
        print('step {}: {}'.format(i, mc.report_memory()))                
        for letter in list('ABCDEF'):
            path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
            comb = np.load(path, allow_pickle=True)
            toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
            toPD.columns = ['title', 'abstract']
            toPD = toPD.loc[toPD['abstract'] != '']
            toPD.to_sql('CS_table', engine, index=False, if_exists='append')

dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)

可打印

step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...         
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551

每行的最后一个数字是该进程报告的进程消耗的内存量 matplotlib.cbook.report_memory .有许多不同的内存使用量度.在Linux上,mc.report_memory()正在报告 核心图像的物理页面大小处理(包括文本,数据和堆栈空间).

The last number on each line is the amount of memory consumed by the process as reported by matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting the size of the physical pages of the core image of the process (including text, data, and stack space).

顺便说一句,您可以使用的另一个管理内存的基本技巧是使用函数. 函数终止时,将释放函数内部的局部变量. 这样可以减轻您手动调用delgc.collect()的负担.

By the way, another basic trick you can use manage memory is to use functions. Local variables inside the function are deallocated when the function terminates. This relieves you of the burden of manually calling del and gc.collect().

这篇关于通过sqlalchemy重复插入sqlite数据库会导致内存泄漏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆