Pandas:如何使用 df.to_dict() 轻松共享示例数据帧? [英] Pandas: How to easily share a sample dataframe using df.to_dict()?

查看:55
本文介绍了Pandas:如何使用 df.to_dict() 轻松共享示例数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题之前被标记为与如何制作良好的可重现熊猫示例.对于任何寻求制作这样一个可重复数据样本的人来说,该贡献无疑应该是首选帖子,而这篇文章旨在阐明一种非常实用且有效的包含方法使用 df.to_dict() 结合 df=pd.DataFrame() 的问题中的给定数据样本.How to make good reproducible-pandas-examples">How to make good reproducible-pandas-examples">How to make good reproducible熊猫示例.使用 df.to_dict()df.to_clipboard() 配合使用也非常有效,在 如何使用 to_clipboard() 提供可复制的 DataFrame 副本

This question was earlier marked as a duplicate of How to make good reproducible pandas examples. That contribution should undoubtedly be the go-to post for anyone seeking to make such a reproducible data sample, while this post is meant to clarify a very practical and efficient way to include a given data sample in a question using df.to_dict() in combination with df=pd.DataFrame(<dict>). This was not explicitly covered in neither the question nor the answers in How to make good reproducible pandas examples. Using df.to_dict() also works very well in tandem with df.to_clipboard(), concisely covered in the post How to provide a reproducible copy of your DataFrame with to_clipboard()

尽管有关于如何提出好问题?如何创建最小的、可重现的示例,许多人似乎忽略了在他们的问题中包含可重现的数据样本.那么当一个简单的 pd.DataFrame(np.random.random(size=(5, 5))) 不够用时,有什么实用且简单的方法来重现数据样本?例如,您如何使用 df.to_dict() 并在问题中包含输出?

Despite the clear and concise guidance on How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough? How can you, for example, use df.to_dict() and include the output in a question?

推荐答案

答案:

在许多情况下,使用带有 df.to_dict() 的方法可以完美地完成工作!以下是我想到的两种情况:

The answer:

In many situations, using an approach with df.to_dict() will do the job perfectly! Here are two cases that come to mind:

案例 1:你有一个用 Python 从本地源构建或加载的数据框

案例 2:您在另一个应用程序(如 Excel)中有一个表格

案例 1:您有一个从本地来源构建或加载的数据框

假设您有一个名为 df 的 Pandas 数据框,只需

Given that you've got a pandas dataframe named df, just

  1. 在控制台或编辑器中运行 df.to_dict(),并且
  2. 复制格式化为字典的输出,并且
  3. 将内容粘贴到 pd.DataFrame() 中,并将该块包含在您现在可重现的代码片段中.
  1. run df.to_dict() in you console or editor, and
  2. copy the output that is formatted as a dictionary, and
  3. paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.


案例 2:您在另一个应用程序(如 Excel)中有一个表格

根据来源和分隔符,如 (',', ';' '\\s+') 其中后者表示任何空格,您可以简单地:

Depending on the source and separator like (',', ';' '\\s+') where the latter means any spaces, you can simply:

  1. Ctrl+C 内容
  2. 在您的控制台或编辑器中运行 df=pd.read_clipboard(sep='\\s+'),并且
  3. 运行df.to_dict(),和
  4. df=pd.DataFrame()
  5. 中包含输出
  1. Ctrl+C the contents
  2. run df=pd.read_clipboard(sep='\\s+') in your console or editor, and
  3. run df.to_dict(), and
  4. include the output in df=pd.DataFrame(<output>)

在这种情况下,您的问题的开头将如下所示:

In this case, the start of your question would look something like this:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

当然,对于较大的数据帧,这会变得有点笨拙.但很多时候,所有想要回答您的问题的人都需要一个真实世界数据的小样本,以将您的数据结构考虑在内.

Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.

  1. 运行 df.head(20).to_dict() 只包含前 20 行,和
  2. 使用例如 df.to_dict('split')(有 'split' 之外的其他选项 将您的输出重塑为需要更少行的字典.
  1. run df.head(20).to_dict() to only include the first 20 rows, and
  2. change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split') to reshape your output to a dict that requires fewer lines.

这是一个使用 iris 数据集以及其他可用位置的示例来自情节表达.

Here's an example using the iris dataset, among other places available from plotly express.

如果你只是运行:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

这将产生近 1000 行的输出,并且作为可重现的样本不是很实用.但是如果你包含 .head(25),你会得到:

This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample. But if you include .head(25), you'll get:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

现在我们到了某个地方.但是根据数据的结构和内容,这可能无法以令人满意的方式覆盖内容的复杂性.但是您可以通过to_dict('split')在更少的行中包含更多数据代码>像这样:

And now we're getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split') like this:

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

现在您的输出将如下所示:

Now your output will look like:

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

现在您可以轻松增加 .head(10) 中的数字,而不会使您的问题过于混乱.但是有一个小缺点.现在您不能再直接在 pd.DataFrame 中使用输入.但是如果你包含一些关于索引、列和数据的规范,你会很好.所以对于这个特定的数据集,我的首选方法是:

And now you can easily increase the number in .head(10) without cluttering your question too much. But there's one minor drawback. Now you can no longer use the input directly in pd.DataFrame. But if you include a few specifications with regards to index, column, and data you'll be just fine. So for this particluar dataset, my preferred approach would be:

import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

现在您将使用此数据框:

Now you'll have this dataframe to work with:

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

这将显着增加您获得有用答案的机会!

Which will increase your chances of receiving useful answers significantly!

df_to_dict() 将无法读取诸如 1: Timestamp('2020-01-02 00:00:00') 之类的时间戳而不包括 从熊猫导入时间戳

df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp

这篇关于Pandas:如何使用 df.to_dict() 轻松共享示例数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆