如何制作好的可复制 pandas 实例 [英] How to make good reproducible pandas examples

查看:91
本文介绍了如何制作好的可复制 pandas 实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

花了相当多的时间观看标签的问题,我得到的是pandas问题不太可能包含可重复的数据. R社区一直非常乐于鼓励,并感谢

Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

我们如何为pandas问题创建良好的可复制示例?可以将简单的数据框放在一起,例如:

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

但是许多示例数据集需要更复杂的结构,例如:

But many example datasets need more complicated structure, e.g.:

  • datetime索引或数据
  • 多个分类变量(是否具有R的expand.grid()函数的等效项,该函数可生成某些给定变量的所有可能组合?)
  • MultiIndex或Panel数据
  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R's expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex or Panel data

对于难以使用几行代码进行模拟的数据集,是否有与R的dput()等效的功能,它允许您生成可复制粘贴的代码来重新生成数据结构?

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput() that allows you to generate copy-pasteable code to regenerate your datastructure?

推荐答案

注意:此处的想法对于Stack Overflow非常通用,实际上

  • 包含小的*示例DataFrame,作为可运行代码:

    • do include small* example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    或使用pd.read_clipboard(sep='\s\s+')使其可复制并粘贴",您可以设置文本以突出显示堆栈溢出高亮并使用 Ctrl + K (或在前面加上四个空格每一行),或在代码上方和下方放置三个波浪号,而无需缩排代码:

    or make it "copy and pasteable" using pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three tildes above and below your code with your code unindented:

    In [2]: df
    Out[2]: 
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    亲自测试pd.read_clipboard(sep='\s\s+').

    * 我的意思确实是 small ,绝大多数示例DataFrame可能少于6行需要引用,并且我敢保证将其分成5行.是否可以用df = df.head()重现该错误,如果没有弄清楚,是否可以组成一个小的DataFrame来显示您所面临的问题.

    * I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head(), if not fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * 每条规则都有一个例外,一个明显的例外是性能问题(在这种情况下,请务必使用%timeit并可能使用%prun ),您应该在其中生成(请考虑使用np.random.seed,以便我们使用完全相同的帧):df = pd.DataFrame(np.random.randn(100000000, 10)).这样说,为我快速编写此代码"并不完全是该站点的主题...

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame): df = pd.DataFrame(np.random.randn(100000000, 10)). Saying that, "make this code fast for me" is not strictly on topic for the site...

    写出您想要的结果(与上面类似)

    write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]: 
       A  B
    0  1  5
    1  4  6
    

    解释数字的来源:5是A为1的行的B列之和.

    显示您尝试过的代码:

    do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]: 
       B
    A   
    1  5
    4  6
    

    但是请说出不正确的地方:A列位于索引中,而不是列中.

    显示您已进行了一些研究(搜索StackOverflow ),给出摘要:

    do show you've done some research (search the docs, search StackOverflow), give a summary:

    sum的文档字符串仅声明计算组值的总和"

    The docstring for sum simply states "Compute sum of group values"

    groupby文档对此没有任何例子.

    此外:答案是使用df.groupby('A', as_index=False).sum().

    是否与您相关的时间戳记"列相关,例如您正在重采样或进行其他操作,然后进行明确说明,并对其进行很好的测量**.

    if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..
    

    ** 有时候这就是问题所在:它们是字符串.

    • 不包含MultiIndex,我们无法复制和粘贴(请参见上文),这对熊猫默认显示有点不满,但很烦人:

    • don't include a MultiIndex, which we can't copy and paste (see above), this is kind of a grievance with pandas default display but nonetheless annoying:

    In [11]: df
    Out[11]:
         C
    A B   
    1 2  3
      2  6
    

    正确的方法是包含具有 set_index 呼叫:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
    In [13]: df
    Out[13]: 
         C
    A B   
    1 2  3
      2  6
    

  • 在提供您想要的结果时可以提供洞察力:

  • do provide insight to what it is when giving the outcome you want:

       B
    A   
    1  1
    5  0
    

    请具体说明如何获取数字(它们是什么)...仔细检查它们是否正确.

    如果您的代码引发错误,请包括整个堆栈跟踪信息(如果噪声太大,可以稍后编辑).显示行号(以及代码所针对的行).

    If your code throws an error, do include the entire stack trace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).

    • 不要链接到我们无权访问的CSV(理想情况下根本不要链接到外部源...)

    • don't link to a csv we don't have access to (ideally don't link to an external source at all...)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options
    

    大多数数据都是专有的,我们得到的是:组成相似的数据,看看是否可以重现问题(有些小).

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

    不要用语言模糊地解释这种情况,就像您有一个大"的DataFrame一样,在传递时提及一些列名(请确保不要提及它们的dtypes).在没有看到实际上下文的情况下,尝试深入探讨一些完全没有意义的细节.大概没人会读到本段末.

    don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    论文不好,用小例子更容易.

    在解决您的实际问题之前,不要包含10+(100+ ??)行数据处理.

    don't include 10+ (100+??) lines of data munging before getting to your actual question.

    请,我们在日常工作中看到了足够的信息.我们想提供帮助,但不喜欢这样... .
    剪切简介,并在引起麻烦的步骤中显示相关的DataFrame(或其中的小版本).

    Please, we see enough of this in our day jobs. We want to help, but not like this....
    Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

    这篇关于如何制作好的可复制 pandas 实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆