减少此Pandas代码读取JSON文件和腌制的内存使用 [英] Reduce memory usage of this Pandas code reading JSON file and pickling

查看:100
本文介绍了减少此Pandas代码读取JSON文件和腌制的内存使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想不出一种方法来进一步减少该程序的内存使用量. 基本上,我是从JSON日志文件读取到pandas数据框,但是:

I can't figure out a way to reduce memory usage for this program further. Basically, I'm reading from JSON log files into a pandas dataframe, but:

  1. 列表append函数是引起此问题的原因.它在内存中创建两个不同的对象,从而导致大量的内存使用.
  2. 熊猫的
  3. .to_pickle方法也是一个巨大的内存消耗,因为内存的最大峰值是在写入泡菜时.
  1. the list append function is what is causing the issue. It creates two different objects in memory, causing huge memory usage.
  2. .to_pickle method of pandas is also a huge memory hog, because the biggest spike in memory is when writing to the pickle.

这是我迄今为止最有效的实施方式:

Here is my most efficient implementation to date:

columns = ['eventName', 'sessionId', "eventTime", "items", "currentPage", "browserType"]
df = pd.DataFrame(columns=columns)
l = []

for i, file in enumerate(glob.glob("*.log")):
    print("Going through log file #%s named %s..." % (i+1, file))
    with open(file) as myfile:
        l += [json.loads(line) for line in myfile]
        tempdata = pd.DataFrame(l)
        for column in tempdata.columns:
            if not column in columns:
                try:
                    tempdata.drop(column, axis=1, inplace=True)
                except ValueError:
                    print ("oh no! We've got a problem with %s column! It don't exist!" % (badcolumn))
        l = []
        df = df.append(tempdata, ignore_index = True)
        # very slow version, but is most memory efficient
        # length = len(df)
        # length_temp = len(tempdata)
        # for i in range(1, length_temp):
        #     update_progress((i*100.0)/length_temp)
        #     for column in columns:
        #         df.at[length+i, column] = tempdata.at[i, column]
        tempdata = 0

print ("Data Frame initialized and filled! Now Sorting...")
df.sort(columns=["sessionId", "eventTime"], inplace = True)
print ("Done Sorting... Changing indices...")
df.index = range(1, len(df)+1)
print ("Storing in Pickles...")
df.to_pickle('data.pkl')

有减少内存的简单方法吗?注释的代码可以完成工作,但是需要100-1000倍的时间.我目前在.to_pickle部分的最大内存使用率为45%,在读取日志的时候为30%.但是日志越多,该数字就越高.

Is there an easy way to reduce memory? The commented code does the job but takes 100-1000x longer. I'm currently at 45% memory usage at max during the .to_pickle part, 30% during the reading of the logs. But the more logs there are, the higher that number goes.

推荐答案

此答案适用于一般的pandas dataFrame内存使用优化:

This answer is for general pandas dataFrame memory usage optimization:

    默认情况下,
  1. 熊猫在 string 列中以 object 类型加载.对于所有具有 object 类型的列,请尝试通过将字典传递给<强大的> read_csv 功能.对于唯一值小于或等于50%的列,内存使用量会急剧下降.

  1. Pandas loads in string columns as object type by default. For all the columns which have the type object, try to assign the type category to these columns by passing a dictionary to parameter dtypes of the read_csv function. Memory usage decreases dramatically for columns with 50% or less unique values.

熊猫默认情况下在数字列中读取为 float64 .如果可能的话,使用 pd.to_numeric float64 类型下调为32或16.再次节省您的内存.

Pandas reads in numeric columns as float64 by default. Use pd.to_numeric to downcast float64 type to 32 or 16 if possible. This again saves you memory.

加载csv数据按块逐个.处理它,然后转到下一个 chunk .可以通过为 read_csv 方法的 chunk_size 参数指定值来实现.

Load in csv data chunk by chunk. Process it, and move on to the next chunk. This can be done by specifying value to the chunk_size parameter of read_csv method.

这篇关于减少此Pandas代码读取JSON文件和腌制的内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆