如何避免将大文件重复加载到python脚本中? [英] How to avoid loading a large file into a python script repeatedly?

查看:86
本文介绍了如何避免将大文件重复加载到python脚本中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个 Python 脚本来获取一个大文件(矩阵 ~50k 行 X ~500 列)并将其用作数据集来训练随机森林模型.

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.

我的脚本有两个功能,一个用于加载数据集,另一个用于使用所述数据训练随机森林模型.这些都可以正常工作,但是文件上传需要大约 45 秒,每次我想训练一个略有不同的模型(在同一数据集上测试多个模型)时,都很难做到这一点.这是文件上传代码:

My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:

def load_train_data(train_file):
    # Read in training file
    train_f = io.open(train_file)
    train_id_list = []
    train_val_list = []
    for line in train_f:
        list_line = line.strip().split("\t")
        if list_line[0] != "Domain":
            train_identifier = list_line[9]
            train_values = list_line[12:]
            train_id_list.append(train_identifier)
            train_val_float = [float(x) for x in train_values]
            train_val_list.append(train_val_float)
    train_f.close()
    train_val_array = np.asarray(train_val_list)

    return(train_id_list,train_val_array)

这将返回一个带有 col 的 numpy 数组.9 作为标签和列.12-end 作为元数据训练随机森林.

This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.

我将用相同的数据训练许多不同形式的模型,所以我只想上传一次文件并让它可用于输入我的随机森林函数.我希望文件成为我认为的对象(我对 python 相当陌生).

I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).

推荐答案

如果我理解正确,数据集没有变化,但模型参数确实发生了变化,并且每次运行后您都在更改参数.

If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.

我会将文件加载脚本放在一个文件中,然后在 python 解释器中运行它.然后文件将加载并与您使用的任何变量一起保存在内存中.

I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.

然后,您可以使用模型代码导入另一个文件,并使用训练数据作为参数运行该文件.

Then you can import another file with your model code, and run that with the training data as argument.

如果您的所有模型更改都可以确定为函数调用中的参数,则您只需导入模型,然后使用不同的参数设置调用训练函数.

If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.

如果您需要在两次运行之间更改模型代码,请使用新文件名保存并导入该文件名,再次运行并将源数据发送到该文件名.

If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.

如果您不想使用新文件名保存每个模型修改,您可以根据 python 版本使用重新加载功能,但不建议这样做(请参阅 从控制台重新加载python模块的正确方法)

If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)

这篇关于如何避免将大文件重复加载到python脚本中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆