科学计算与Ipython Notebook:如何组织代码? [英] Scientific Computing & Ipython Notebook: How to organize code?

查看:110
本文介绍了科学计算与Ipython Notebook:如何组织代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Ipython Notebook进行研究.随着文件的增大,我会不断提取代码,例如plot方法,fitting方法等.

I'm using Ipython Notebook to my research. As my file grows bigger, I constantly extract code out, things like plot method, fitting method etc.

我想我需要一种组织这种方式的方法.有什么好办法吗?

I think I need a way to organize this. Is there any good way to do it??

目前,我是通过以下方式实现的:

Currently, I do this by:

data/
helpers/
my_notebook.ipynb
import_file.py

我将数据存储在data/,然后将helper method提取到helpers/,并将它们划分为plot_helper.pyapp_helper.py等文件.

I store data at data/, and extract helper method into helpers/, and divide them into files like plot_helper.py, app_helper.py, etc.

我总结了import_file.py

from IPython.display import display

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
import sklearn
import re

然后,我可以在.ipynb顶部单元格中将所需的所有内容导入

And then I can import everything I need in .ipynb at top cell as

该结构可以在 https://github.com/cqcn1991/Wind-Speed-分析

我现在遇到的一个问题是,在helpers/处有太多子模块,很难考虑应该将哪种方法放入哪个文件中.

One problem I have right now is that I have too many submodule at helpers/, and it's hard to think which method should be put into which file.

我认为一种可能的方式是在pre-processingprocessingpost-processing中进行组织.

I think a possible way is to organize in pre-processing, processing, post-processing.

更新:

我的大型jupyter研究笔记本: https://cdn.rawgit.com/cqcn1991/Wind -Speed-Analysis/master/output_HTML/marham.html

My big jupyter research notebook: https://cdn.rawgit.com/cqcn1991/Wind-Speed-Analysis/master/output_HTML/marham.html

顶部的单元格是standard import + magic + extentions

The top cell is standard import + magic + extentions

%matplotlib inline
%load_ext autoreload
%autoreload 2

from __future__ import division
from import_file import *
load_libs()

推荐答案

有很多方法可以组织ipython研究项目.我管理着一个由5位数据科学家和3位数据工程师组成的团队,我发现这些技巧在我们的用例中很有效:

There are many ways to organise ipython research project. I am managing a team of 5 Data Scientists and 3 Data Engineers and I found those tips to be working well for our usecase:

这是我在PyData伦敦的演讲的摘要:

This is a summary of my PyData London talk:

http://www.slideshare.net/vladimirkazantsev/clean- Jupyter笔记本中的代码

1.创建一个共享的(多项目)utils库

您很可能必须在不同的研究项目中重用/重复某些代码.开始将这些内容重构为"common utils"包.制作setup.py文件,将模块推送到github(或类似文件),以便团队成员可以从VCS点安装"该文件.

You most likely have to reuse/repeat some code in different research projects. Start refactoring those things into "common utils" package. Make setup.py file, push module to github (or similar), so that team members can "pip install" it from VCS.

其中包含的功能示例:

  • 数据仓库或存储访问功能
  • 常用绘图功能
  • 可重用的数学/统计方法

2.将您的胖主笔记本分割成较小的笔记本

以我的经验,带有代码(任何语言)的文件的长度只有几个屏幕(100-400行). Jupyter Notebook仍然是源文件,但带有输出!读一本有20多个单元的笔记本非常困难.我希望我的笔记本最多可以容纳4-10个电池.

In my experience, the good length of file with code (any language) is only few screens (100-400 lines). Jupyter Notebook is still the source file, but with output! Reading a notebook with 20+ cells is very hard. I like my notebooks to have 4-10 cells max.

理想情况下,每个笔记本应该有一个假设数据结论"三元组.

Ideally, each notebook should have one "hypothesis-data-conclusions" triplet.

拆分笔记本的示例:

1_data_preparation.ipynb

1_data_preparation.ipynb

2_data_validation.ipynb

2_data_validation.ipynb

3_exploratory_plotting.ipynb

3_exploratory_plotting.ipynb

4_simple_linear_model.ipynb

4_simple_linear_model.ipynb

5_hierarchical_model.ipynb

5_hierarchical_model.ipynb

playground.ipynb

playground.ipynb

将1_data_preparation.ipynb的输出保存到泡菜df.to_pickle('clean_data.pkl'),csv或快速DB中,并在每个笔记本的顶部使用pd.read_pickle("clean_data.pkl").

Save output of 1_data_preparation.ipynb to pickle df.to_pickle('clean_data.pkl'), csv or fast DB and use pd.read_pickle("clean_data.pkl") at the top of each notebook.

3.不是Python-是IPython Notebook

使笔记本电脑与众不同的是单元格.好好利用它们. 每个单元格应为想法执行输出"三元组.如果单元格不输出任何内容,请与以下单元格组合.导入单元格应该什么也不输出-这是预期的输出.

What makes notebook unique is cells. Use them well. Each cell should be "idea-execution-output" triplet. If cell does not output anything - combine with the following cell. Import cell should output nothing -this is an expected output for it.

如果单元格输出很少-可能值得将其拆分.

If cell have few outputs - it may be worth splitting it.

隐藏进口可能不是一个好主意:

Hiding imports may or may not be good idea:

from myimports import *

您的读者可能想弄清楚您要导入的内容,以便将相同的内容用于她的研究.因此,请谨慎使用.但是,我们确实将它用于pandas, numpy, matplotlib, sql.

Your reader may want to figure out what exactly you are importing to use the same stuff for her research. So use with caution. We do use it for pandas, numpy, matplotlib, sql however.

在/helpers/model.py中隐藏秘密调味料"是不好的:

Hiding "secret sauce" in /helpers/model.py is bad:

myutil.fit_model_and_calculate(df)

这可以节省您的键入时间,并且您将删除重复的代码,但是您的协作者将不得不打开另一个文件来了解正在发生的事情.不幸的是,笔记本(jupyter)是一个非常不灵活且基本的环境,但是您仍然不想强迫读者将其留给每一段代码.我希望将来的IDE会有所改进,但现在,在笔记本中保留秘密调味料" .而无聊而明显的工具"-无论您认为合适. DRY仍然适用-您必须找到余额.

This may save you typing and you will remove duplicate code, but your collaborator will have to open another file to figure out what's going on. Unfortunately, notebook (jupyter) is quite inflexible and basic environment, but you still don't want to force your reader to leave it for every piece of code. I hope that in the future IDE will improve, but for now, keep "secret sauce" inside a notebook. While "boring and obvious utils" - wherever you see fit. DRY still apply - you have to find the balance.

这不应阻止您将可重复使用的代码包装到函数甚至小型类中.但是扁平比嵌套更好".

This should not stop you from packaging re-usable code into functions or even small classes. But "flat is better than nested".

4.保持笔记本清洁

您应该能够在任何时间点重置并运行全部".

You should be able to "reset & Run All" at any point in time.

每次重新运行都应该很快!这意味着您可能需要投资编写一些缓存功能.也许您甚至想将它们放入通用utils"模块中.

Each re-run should be fast! Which means you may have to invest in writing some caching functions. May be you even want to put those into your "common utils" module.

每个单元应可执行多次,而无需重新初始化笔记本.这样可以节省您的时间,并使代码更健壮. 但这可能取决于以前的单元所创建的状态.使每个单元与上面的单元完全独立是一种反模式IMO.

Each cell should be executable multiple times, without the need to re-initialise the notebook. This saves you time and keep the code more robust. But it may depend on state created by previous cells. Making each cell completely independent from the cells above is an anti-pattern, IMO.

完成研究后-您不会完成笔记本工作.重构.

After you are done with research - you are not done with notebook. Refactor.

5.创建一个项目模块,但要有选择性

如果您继续重复使用绘图或分析功能-请将其重构到此模块中.但是根据我的经验,人们希望阅读和理解笔记本,而无需打开多个util子模块.因此,与普通的Python相比,在此命名子例程更为重要.

If you keep re-using plotting or analytics function - do refactor it into this module. But in my experience, people expect to read and understand a notebook, without opening multiple util sub-modules. So naming your sub-routines well is even more important here, compared to normal Python.

干净的代码看起来像写得很好的散文" Grady Booch(UML的开发者)

"Clean code reads like well written prose" Grady Booch (developer of UML)

6.为整个团队在云中托管Jupyter服务器

您将拥有一个环境,因此每个人都可以快速查看和验证研究,而无需匹配环境(即使conda使得这非常容易).

You will have one environment, so everyone can quickly review and validate research without the need to match the environment (even though conda makes this pretty easy).

默认情况下,您可以配置默认值,例如mpl样式/颜色,并使matplot lib内联:

And you can configure defaults, like mpl style/colors and make matplot lib inline, by default:

~/.ipython/profile_default/ipython_config.py

添加行c.InteractiveShellApp.matplotlib = 'inline'

7. (实验想法)从另一个笔记本运行一个具有不同参数的笔记本

通常,您可能希望重新运行整个笔记本,但是使用不同的输入参数.

Quite often you may want to re-run the whole notebook, but with a different input parameters.

为此,您可以按以下方式构建研究笔记本: 将 params 字典放在源笔记本"的第一个单元格中.

To do this, you can structure your research notebook as following: Place params dictionary in the first cell of "source notebook".

params = dict(platform='iOS', 
              start_date='2016-05-01', 
              retention=7)
df = get_data(params ..)
do_analysis(params ..)

然后在另一个(较高逻辑级别)笔记本中,使用以下功能执行该笔记本:

And in another (higher logical level) notebook, execute it using this function:

def run_notebook(nbfile, **kwargs):
    """
    example:
    run_notebook('report.ipynb', platform='google_play', start_date='2016-06-10')
    """

    def read_notebook(nbfile):
        if not nbfile.endswith('.ipynb'):
            nbfile += '.ipynb'

        with io.open(nbfile) as f:
            nb = nbformat.read(f, as_version=4)
        return nb

    ip = get_ipython()
    gl = ip.ns_table['user_global']
    gl['params'] = None
    arguments_in_original_state = True

    for cell in read_notebook(nbfile).cells:
        if cell.cell_type != 'code':
            continue
        ip.run_cell(cell.source)

        if arguments_in_original_state and type(gl['params']) == dict:
            gl['params'].update(kwargs)
            arguments_in_original_state = False

这种设计模式"是否被证明是有用的,还有待观察.我们取得了一些成功-至少我们停止了复制笔记本,只是更改了很少的输入.

Whether this "design pattern" proves to be useful is yet to be seen. We had some success with it - at least we stopped duplicating notebooks only to change few inputs.

将笔记本重构为类或模块会中断单元提供的思想执行输出"的快速反馈循环.而且,恕我直言,不是"ipythonic".

Refactoring the notebook into a class or module break quick feedback loop of "idea-execute-output" that cells provide. And, IMHO, is not "ipythonic"..

8.为笔记本中的共享库编写(单元)测试,并使用py.test

有一个用于py.test的插件,可以发现并在笔记本中运行测试!

There is a Plugin for py.test that can discover and run tests inside notebooks!

https://pypi.python.org/pypi/pytest-ipynb

这篇关于科学计算与Ipython Notebook:如何组织代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆