避免每次使用 Python 处理数据时都重新加载数据 [英] Avoid reloading data each time you want to work on it with Python
问题描述
我有一个永远不会改变的大型数据集(我从不直接修改它).我首先用熊猫阅读它
I have a large dataset that never changes (I never modify it directly). I start by reading it with pandas
dataset = pandas.read_csv(filepath)
然后我做一些数据分析.初始文件加载大约需要 10 秒,我目前每次更改数据分析部分中的某些内容时都会重新运行它.如何一劳永逸地加载数据并只运行分析部分?
and then I do some data analysis. The initial file loading takes about 10 seconds and I am currently re-running it every time I change something in the data analysis part. How can I load the data once and for all and run only the analysis part?
推荐答案
这个问题的答案在一定程度上取决于您尚未分享的细节.最好的方法可能是序列化您正在构建的最终数据结构.
The answer to this depends a bit on details you haven't shared. Probably the best approach is going to involve serializing the final data structure which you are building.
创建一个方法,它读取csv
并构建您感兴趣的任何数据结构.构建后,使用pickle
输出结构.然后,在程序加载时从 pickle 中解压数据结构.
Create a method which reads in the csv
and builds whatever data struct you're interested in. Once constructed, output the structure using pickle
. Then, unpack the datastructure from the pickle when your program loads.
我假设这里耗时的部分是您一遍又一遍地启动程序.如果程序正在运行,那么您应该将数据结构保存在活动内存中的一个集中位置.这里的幼稚方法是全局性的,你不应该这样做,我只是出于概念目的提到.
I'm assuming the time-consuming part here is that you are bringing the program up over and over again. If the program is staying up, then you should just be saving the datastructure in a centralized location in active memory. The naive approach here being a global, which you should not do, and I mention just for conceptual purposes.
这篇关于避免每次使用 Python 处理数据时都重新加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!