将 jupyter notebooks 转为 python 脚本的最佳实践 [英] Best practices for turning jupyter notebooks into python scripts

查看:77
本文介绍了将 jupyter notebooks 转为 python 脚本的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Jupyter (iPython) 笔记本是当之无愧的一款用于代码原型设计和以交互方式进行各种机器学习工作的好工具.但是我在使用的时候,难免会遇到以下情况:

Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:

  • 笔记本很快变得过于复杂和凌乱,无法作为笔记本进一步维护和改进,我必须用它制作 python 脚本;
  • 说到生产代码(例如需要每天重新运行的代码),笔记本又不是最好的格式.

假设我已经在 jupyter 中开发了一个完整的机器学习管道,其中包括从各种来源获取原始数据、清理数据、特征工程和训练模型.现在用高效可读的代码制作脚本的最佳逻辑是什么?到目前为止,我曾经通过多种方式解决它:

Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:

  1. 只需将 .ipynb 转换为 .py,只需稍作更改,即可将笔记本中的所有管道硬编码为一个 Python 脚本.

  1. Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.

  • '+':快速
  • '-':脏,不灵活,不方便维护

制作一个包含多个函数的脚本(大约,每个一两个单元格一个函数),尝试用单独的函数组成管道的各个阶段,并相应地命名它们.然后通过argparse指定所有参数和全局常量.

Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse.

  • '+':使用更灵活;更具可读性的代码(如果您正确地将管道逻辑转换为函数)
  • '-':通常,管道不能拆分成逻辑上完整的部分,这些部分可以成为没有代码中任何怪癖的功能.所有这些函数通常只需要在脚本中调用一次,而不是在循环、映射等中多次调用.此外,每个函数通常都需要之前调用的所有函数的输出,因此必须将许多参数传递给每个函数功能.

与第 (2) 点相同,但现在将所有函数包装在类中.现在所有的全局常量,以及每个方法的输出都可以存储为类属性.

The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.

  • '+':你不需要向每个方法传递很多参数——所有之前的输出都已经存储为属性
  • '-':任务的整体逻辑仍未捕获——它是数据和机器学习管道,而不仅仅是类.类的唯一目标是创建,依次调用所有方法,然后删除.最重要的是,类的实现时间很长.

使用多个脚本将笔记本转换为 python 模块.我没有尝试过,但我怀疑这是解决问题的最长方法.

Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.

我想,这种整体设置在数据科学家中很常见,但令人惊讶的是,我找不到任何有用的建议.

I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.

各位,请分享您的想法和经验.你有没有遇到过这个问题?你是怎么解决的?

Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?

推荐答案

救命稻草:在编写笔记本时,逐步将代码重构为函数,编写一些最小的assert 测试和文档字符串.

Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.

之后,从 notebook 重构到脚本就很自然了.不仅如此,即使您不打算将它们变成其他任何东西,它也能让您在编写长笔记本时生活更轻松.

After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.

具有最少"测试和文档字符串的单元格内容的基本示例:

Basic example of a cell's content with "minimal" tests and docstrings:

def zip_count(f):
    """Given zip filename, returns number of files inside.

    str -> int"""
    from contextlib import closing
    with closing(zipfile.ZipFile(f)) as archive:
        num_files = len(archive.infolist())
    return num_files

zip_filename = 'data/myfile.zip'

# Make sure `myfile` always has three files
assert zip_count(zip_filename) == 3
# And total zip size is under 2 MB
assert os.path.getsize(zip_filename) / 1024**2 < 2

print(zip_count(zip_filename))

一旦您将其导出到裸 .py 文件,您的代码可能还没有被构建到类中.但是值得努力重构你的笔记本,使其具有一组文档化的函数,每个函数都有一组简单的 assert 语句,可以轻松地移动到 tests.py 中 用于使用 pytestunittest 或其他工具进行测试.如果有意义的话,在此之后将这些函数捆绑到您的类的方法中是非常容易的.

Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.

如果一切顺利,之后您需要做的就是编写您的<代码>if __name__ == '__main__': 及其钩子":如果您正在编写由终端调用的脚本,您将希望 处理命令行参数,如果您正在编写一个模块,您需要考虑 它的 API 和 __init__.py 文件 等.

If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.

当然,这完全取决于预期的用例是什么:将笔记本转换为小脚本与将其转换为成熟的模块或包之间存在很大差异.

It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.

以下是从笔记本到脚本的工作流程的一些想法:

  1. 通过 GUI 将 Jupyter Notebook 导出为 Python 文件 (.py).
  2. 删除不执行实际工作的帮助程序"行:print 语句、绘图等.
  3. 如果需要,将您的逻辑捆绑到类中.唯一需要的额外重构工作应该是编写您的类文档字符串和属性.
  4. 使用 if __name__ == '__main__' 编写脚本的入口.
  5. 将每个函数/方法的 assert 语句分开,并在 tests.py 中充实一个最小的测试套件.
  1. Export the Jupyter Notebook to Python file (.py) through the GUI.
  2. Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
  3. If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
  4. Write your script's entryways with if __name__ == '__main__'.
  5. Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.

这篇关于将 jupyter notebooks 转为 python 脚本的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆