将大型对象加载到一个文件中并将其导入其他Python文件中是一种好习惯吗? [英] Is it good practice to load large objects in one file and import them in other Python files?

查看:64
本文介绍了将大型对象加载到一个文件中并将其导入其他Python文件中是一种好习惯吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在一个Python项目中意识到,我需要加载许多大小相当大的对象(也许是千兆字节...)

I recently realized, in a Python project, that I need to load a number of objects that are quite significant in size (perhaps Gigabytes...)

看来我有两个选择:

  1. 将它们加载到我的顶级文件中,然后将其传递给需要它们的函数.但是,如果许多功能需要访问许多大对象,这很容易会令人厌烦.

  1. Load them in my top-level file and pass them on to functions that need them. But that could easily get annoyingly tiresome if many functions need to access many large objects.

将所有大对象加载到一个文件中,然后根据需要将所需的那些对象导入整个项目的任何Python文件中.现在,这看起来已经很整洁了.我在 large_objects.py 文件中会有这样的内容:

Load all large objects in one file and import those needed on demand in any Python file across the project. Now this seems neat enough. I would have something like this in a large_objects.py file:

huge_file_1 = load_huge_file_1()
huge_file_2 = load_huge_file_2()
huge_file_3 = load_huge_file_3()

现在,我不确定第二种方法是否足够有效.例如,每次导入某些内容时,Python解释器都会执行load_huge_file_i()吗.

Now, I'm not sure if the second approach efficient enough. For example, will the Python interpreter execute load_huge_file_i() every time something is imported.

还有,您还有其他建议吗?

Also, do you have any other suggestions?

推荐答案

注意事项的组合.因此,首先,导入文件时将执行文件中仅有的平面文件,这通常是立即执行的.因此,我经常在文件中预先加载文件.但这不必位于一个位置.您可以在许多不同的文件中执行此操作.例如:

Couple of considerations. So firstly, anything that's just flat in a file will be executed when the file is imported, which is usually immediately. So I often do my loading upfront in my files. But it doesn't have to be in one location. You can do this in many different files. For example:

慢:

from nltk.corpus import words

def check_word(word):
    return word in words.words():    # gets executed on every call

快速:

from nltk.corpus import words
WORDS = frozenset(words.words())   # Only gets executed once
def check_word(word):
    return word in WORDS

然而,另一个考虑因素是内存.如果您将很多对象尽早加载到内存中,即使不使用它们,它们也会随身携带.Python使用引用计数,而不是垃圾回收,这意味着,如果对对象有任何剩余引用,它将保留在内存中.考虑

Another consideration however is memory. If you load a lot of objects into memory early on you will carry them around even when they aren't being used. Python uses reference counting, not garbage collection, which means that if there is any remaining reference to an object it will stay in memory. Consider

# Inefficient memory usage
import pandas as pd
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
do_something(df_1)
do_something(df_2)
df_1.to_csv('output1.csv')
df_2.to_csv('output2.csv')
# Efficient memory usage
df_1 = pd.read_csv('first.csv')
do_something(df_1)
df_1.to_csv('output1.csv')    # Last reference to df_1. dropped from memory here
df_2 = read_csv('second.csv')   # Only df_2 in memory now
do_something(df_2)
df_2.to_csv('output2.csv')

换句话说,一个要问自己的关键问题是我需要同时将多少数据存储在内存中,或者我可以将问题分解为独立的问题",然后可以分批处理处理以提高内存效率.

In other words, a key question to ask yourself is "how much of my data needs to be in memory at the same time, or how much can I break the problem down into independent problems", and then you can batch your processing to be memory efficient.

在一个位置加载大型对象很好,但是我认为效率不是必需的.您只想确保您不会在循环中进行任何多次加载(例如,一遍又一遍地执行相同的SQL查询).将其放在模块级别,或者如果您使用的是类,则在初始化时进行.

Loading up large objects in one location is fine but I don't think necessary for efficiency. You just want to make sure that you aren't doing multiple loads of anything in a loop (for example, executing the same SQL query over and over). Put that at a module level, or if you are using classes, then do it at initialization.

我通常的管道是我有一个输入/输出层,一个转换层和一个控制器.控制器通常会在早期调用文件,然后将其转发以进行处理.但是我所做的工作通常是同步的,因此我可以轻松地让控制器知道发生了什么.如果您有一个异步项目,那么正如您所说的,该管道可能会失控.我认为只有一个文件可以处理上传内容.但是,正如所解释的那样,甚至甚至没有必要,只需确保它在模块级别或类的初始化中发生即可.

My usual pipeline is that I have an input/ output layer, a transform layer, and a controller. The controller calls for files usually early on, and then forwards them to be processed. But the work that I do is usually synchronous, so I can easily have a controller know what's going on. If you have an asynchronous project, then as you say this pipeline might get out of hand. Having one file to handle uploads I think is fairly common. But it might not even be necessary, as explained, just make sure it happens at a module level or at initialisation of a class.

这篇关于将大型对象加载到一个文件中并将其导入其他Python文件中是一种好习惯吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆