避免每次使用 Python 处理数据时都重新加载数据 [英] Avoid reloading data each time you want to work on it with Python

查看:87
本文介绍了避免每次使用 Python 处理数据时都重新加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个永远不会改变的大型数据集(我从不直接修改它).我首先用熊猫阅读它

I have a large dataset that never changes (I never modify it directly). I start by reading it with pandas

dataset = pandas.read_csv(filepath)

然后我做一些数据分析.初始文件加载大约需要 10 秒,我目前每次更改数据分析部分中的某些内容时都会重新运行它.如何一劳永逸地加载数据并只运行分析部分?

and then I do some data analysis. The initial file loading takes about 10 seconds and I am currently re-running it every time I change something in the data analysis part. How can I load the data once and for all and run only the analysis part?

推荐答案

这个问题的答案在一定程度上取决于您尚未分享的细节.最好的方法可能是序列化您正在构建的最终数据结构.

The answer to this depends a bit on details you haven't shared. Probably the best approach is going to involve serializing the final data structure which you are building.

创建一个方法,它读取csv 并构建您感兴趣的任何数据结构.构建后,使用pickle 输出结构.然后,在程序加载时从 pickle 中解压数据结构.

Create a method which reads in the csv and builds whatever data struct you're interested in. Once constructed, output the structure using pickle. Then, unpack the datastructure from the pickle when your program loads.

我假设这里耗时的部分是您一遍又一遍地启动程序.如果程序正在运行,那么您应该将数据结构保存在活动内存中的一个集中位置.这里的幼稚方法是全局性的,你不应该这样做,我只是出于概念目的提到.

I'm assuming the time-consuming part here is that you are bringing the program up over and over again. If the program is staying up, then you should just be saving the datastructure in a centralized location in active memory. The naive approach here being a global, which you should not do, and I mention just for conceptual purposes.

这篇关于避免每次使用 Python 处理数据时都重新加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆