将Jupyter Notebook转换为python脚本的最佳实践 [英] Best practices for turning jupyter notebooks into python scripts

查看:922
本文介绍了将Jupyter Notebook转换为python脚本的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Jupyter(iPython)笔记本被公认为是用于对代码进行原型设计和交互地进行各种机器学习的好工具.但是当我使用它时,我不可避免地会遇到以下情况:

Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:

  • 笔记本很快变得过于复杂和混乱,无法像笔记本一样进行进一步的维护和改进,我必须使用它来制作python脚本;
  • 涉及生产代码(例如每天需要重新运行的代码)时,笔记本又不是最佳格式.

假设我已经在jupyter中开发了一个完整的机器学习管道,包括从各种来源获取原始数据,清理数据,进行特征工程以及训练模型.现在,用高效且易读的代码从中编写脚本的最佳逻辑是什么?到目前为止,我以前曾通过几种方式解决它:

Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:

  1. 只需将.ipynb转换为.py,只需稍作更改,即可将笔记本中的所有管道硬编码为一个python脚本.

  1. Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.

  • '+':快速
  • '-':肮脏,不灵活,维护不方便

制作一个包含许多功能的脚本(大约每个一个或两个单元格1个功能),尝试组成具有单独功能的管道阶段,并相应地命名它们.然后通过argparse指定所有参数和全局常量.

Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse.

  • '+':更灵活的用法;更具可读性的代码(如果您将流水线逻辑正确转换为函数)
  • '-':通常,流水线无法拆分成逻辑上完整的部分,这些部分可以成为功能,而无需代码中的任何怪癖.所有这些函数通常只需要在脚本中被调用一次,而不是在循环,映射等内部被多次调用.此外,每个函数通常都获取之前调用的所有函数的输出,因此每个函数都必须传递许多参数.功能.

与点(2)相同,但是现在将所有函数包装在类中.现在,所有全局常量以及每种方法的输出都可以存储为类属性.

The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.

  • '+':您无需为每个方法传递许多参数-先前的所有输出均已存储为属性
  • '-':仍未捕获任务的整体逻辑-它是数据和机器学习管道,而不仅仅是类.创建该类的唯一目标是,依次依次调用所有方法,然后将其删除.最重要的是,类的实现时间很长.

使用多个脚本将笔记本转换为python模块.我没有尝试过,但是我怀疑这是解决问题的最长方法.

Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.

我想,这种整体设置在数据科学家中非常普遍,但是令人惊讶的是我找不到任何有用的建议.

I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.

请与大家分享您的想法和经验.您是否遇到过此问题?您是如何解决的?

Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?

推荐答案

救生器:在编写笔记本时,将代码逐步重构为函数,编写一些最小的assert测试和文档字符串.

Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.

在那之后,从笔记本到脚本的重构是很自然的.不仅如此,即使您没有计划将它们变成其他任何东西,它也可以使您在写长笔记本时的生活更加轻松.

After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.

带有最小"测试和文档字符串的单元格内容的基本示例:

Basic example of a cell's content with "minimal" tests and docstrings:

def zip_count(f):
    """Given zip filename, returns number of files inside.

    str -> int"""
    from contextlib import closing
    with closing(zipfile.ZipFile(f)) as archive:
        num_files = len(archive.infolist())
    return num_files

zip_filename = 'data/myfile.zip'

# Make sure `myfile` always has three files
assert zip_count(zip_filename) == 3
# And total zip size is under 2 MB
assert os.path.getsize(zip_filename) / 1024**2 < 2

print(zip_count(zip_filename))

一旦将其导出到裸露的.py文件中,您的代码可能还不会被结构化为类.但是,值得努力将笔记本重构为具有一系列文档功能的功能,每个功能都带有一组简单的assert语句,这些语句可以轻松地移至tests.py中以进行pytestunittest,或者您有什么.如果可以的话,将这些函数绑定到类的方法中很容易.

Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.

如果一切顺利,那之后您要做的就是编写 if __name__ == '__main__': 及其钩子":如果要编写要由终端调用的脚本,则需要带有__init__.py文件等的API.

If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.

当然,这完全取决于预期的用例:将笔记本转换为小脚本与将其转换为完整的模块或软件包之间有很大的区别.

It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.

有关笔记本到脚本工作流程的一些想法:

  1. 通过GUI将Jupyter Notebook导出为Python文件(.py).
  2. 删除不执行实际工作的"helper"行:print语句,图解等.
  3. 如果需要,将您的逻辑捆绑到类中.唯一需要进行的额外重构工作应该是编写类的文档字符串和属性.
  4. if __name__ == '__main__'编写脚本的入口.
  5. 为每个函数/方法分离assert语句,并在tests.py中充实最小的测试套件.
  1. Export the Jupyter Notebook to Python file (.py) through the GUI.
  2. Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
  3. If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
  4. Write your script's entryways with if __name__ == '__main__'.
  5. Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.

这篇关于将Jupyter Notebook转换为python脚本的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆