如何融化 pandas 数据框? [英] How do I melt a pandas dataframe?

查看:25
本文介绍了如何融化 pandas 数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是 melt 函数的简单逻辑.

解决方案:

我会解决我自己的问题.

问题 1:

问题 1 可以使用 pd 解决.DataFrame.melt 使用以下代码:

print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))

此代码将 id_vars 参数传递给 ['Name', 'Age'],然后 value_vars 将自动设置为其他列(['Math', 'English']),转换为该格式.

您也可以使用 stack 解决问题 1 如下:

打印(df.set_index([姓名", 年龄"]).堆().reset_index(name=等级").重命名(列={level_2":主题"}).sort_values(主题").reset_index(drop=True))

此代码将NameAge 列设置为索引,并将其余的MathEnglish,并重置索引并指定Grade作为列名,然后将另一列level_2重命名为Subject,然后按Subject 列,然后最后再次重置索引.

这两种解决方案的输出:

 姓名年龄科目成绩0 鲍勃 13 英语 C约翰一书 16 年英语 B2 Foo 14 英语 B3 小节 15 英语 A+4 亚历克斯 17 英语 F5 汤姆 12 英语 A6 鲍勃 13 数学 A+7 约翰 16 数学 B8 Foo 14 数学 A9 小节 15 数学 F10 亚历克斯 17 数学 D11 汤姆 12 数学 C

问题 2:

这和我的第一个问题类似,但是这个我只在 Math 列中过滤,这次可以使用 value_vars 参数,就像下面:

打印(df.melt(id_vars=[姓名",年龄"],value_vars=数学",var_name="主题",value_name=成绩",))

或者我们也可以使用stack 列规范:

打印(df.set_index([姓名", 年龄"])[[数学"]].堆().reset_index(name=等级").重命名(列={level_2":主题"}).sort_values(主题").reset_index(drop=True))

这两种解决方案都给出:

 姓名年龄科目成绩0 鲍勃 13 数学 A+约翰一书 16 数学 B2 Foo 16 数学 A3 小节 15 数学 F4 亚历克斯 15 数学 D5 汤姆 13 数学 C

问题 3:

问题 3 可以通过 melt<解决/code>groupby,使用带有', '.joinagg函数,如下所示:

打印(df.melt(id_vars=[姓名", 年龄"]).groupby(值", as_index=False).agg(", ".join))

它融合数据框,然后按成绩分组并聚合它们并用逗号将它们连接起来.

stack 也可以用来解决这个问题,堆栈groupby 如下:

打印(df.set_index([姓名", 年龄"]).堆().reset_index().rename(columns={"level_2": "Subjects", 0: "Grade"}).groupby(成绩", as_index=False).agg(", ".join))

这个stack 函数只是以等效于 的方式转置数据框melt,然后重置索引,重命名列和组以及聚合.

两种解决方案的输出:

 年级名称 科目0 A Foo, Tom Math, 英语1 A+ Bob,酒吧数学,英语2 B John, John, Foo Math, English, English3 C 鲍勃、汤姆英语、数学4 D 亚历克斯数学5 F Bar, Alex Math, 英语

问题 4:

我们首先融化输入数据的数据框:

df = df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades')



那么现在我们可以开始解决这个问题4了.

问题 4 可以通过 pivot_table<解决/code>,我们必须指定 pivot_table 参数、valuesindexcolumnsaggfunc>.

我们可以用下面的代码解决它:

打印(df.pivot_table(Grades", [Name", Age"], Subject", aggfunc=first").reset_index().rename_axis(列=无))

输出:

 Name Age English Math0 亚历克斯 15 F D1 巴 15 A+ F2 鲍勃 13 C A+3 富 16 BA4 约翰 16 B B5 汤姆 13 A C

融化的数据帧被转换回与原始数据帧完全相同的格式.

我们首先旋转熔化的数据框,然后重置索引并删除列轴名称.

问题 5:

问题 5 可以通过 melt<解决/code>groupby如下:

打印(df.melt(id_vars=[Name", Age"], var_name=Subject", value_name=Grades").groupby(名称", as_index=False).agg(", ".join))

根据 Name 融合和分组.

或者你可以stack:

打印(df.set_index([姓名", 年龄"]).堆().reset_index().groupby(名称", as_index=False).agg(", ".join).rename({"level_2": "Subjects", 0: "Grades"},axis=1))

两个代码输出:

 命名科目成绩0 亚历克斯数学,英语 D,F1 小节数学,英语 F,A+2 鲍勃数学,英语 A+,C3 Foo Math, 英语 A, B4 约翰数学,英语 B,B5 汤姆数学,英语 C,A

问题 6:

问题 6 可以通过 melt<解决/code> 并且不需要指定列,只需指定预期的列名:

print(df.melt(var_name='Column', value_name='Value'))

这融化了整个数据框

或者你可以stack:

打印(df.stack().reset_index(level=1).sort_values(level_1").reset_index(drop=True).set_axis([列",值"],轴=1))

两个代码输出:

 列值0 16 岁1 15 岁2 15 岁3 16 岁4 13 岁5 13 岁6 英语 A+7 英语 B8 英语 B9 英语 A10 英语 F11 英语 C12 数学 C13 数学 A+14 数学 D15 数学 B16 数学 F17 数学 A18 名字亚历克斯19 名称栏20 名字汤姆第21话22 约翰的名字23 鲍勃的名字

结论:

melt 是一个非常方便的功能,通常是必需的,一旦你遇到这些类型的问题,不要忘记尝试 melt,它可以很好地解决您的问题.

请记住在 < 下使用 pandas 版本的用户0.20.0,你必须使用 pd.melt(df, ...) 而不是 df.melt(...).


On the tag, I often see users asking questions about melting dataframes in pandas. I am gonna attempt a cannonical Q&A (self-answer) with this topic.

I am gonna clarify:

  1. What is melt?

  2. How do I use melt?

  3. When do I use melt?

I see some hotter questions about melt, like:

So I am gonna attempt a canonical Q&A for this topic.



Dataset:

I will have all my answers on this dataset of random grades for random people with random ages (easier to explain for the answers :D):

import pandas as pd
df = pd.DataFrame({'Name': ['Bob', 'John', 'Foo', 'Bar', 'Alex', 'Tom'], 
                   'Math': ['A+', 'B', 'A', 'F', 'D', 'C'], 
                   'English': ['C', 'B', 'B', 'A+', 'F', 'A'],
                   'Age': [13, 16, 16, 15, 15, 13]})


>>> df
   Name Math English  Age
0   Bob   A+       C   13
1  John    B       B   16
2   Foo    A       B   16
3   Bar    F      A+   15
4  Alex    D       F   15
5   Tom    C       A   13
>>> 

Problems:

I am gonna have some problems and they will be solved in my self-answer below.

Problem 1:

How do I melt a dataframe so that the original dataframe becomes:

    Name  Age  Subject Grade
0    Bob   13  English     C
1   John   16  English     B
2    Foo   14  English     B
3    Bar   15  English    A+
4   Alex   17  English     F
5    Tom   12  English     A
6    Bob   13     Math    A+
7   John   16     Math     B
8    Foo   14     Math     A
9    Bar   15     Math     F
10  Alex   17     Math     D
11   Tom   12     Math     C

I want to transpose this so that one column would be each subject and the other columns would be the repeated names of the students and there age and score.

Problem 2:

This is similar to Problem 1, but this time I want to make the Problem 1 output Subject column only have Math, I want to filter out the English column:

   Name  Age Subject Grades
0   Bob   13    Math     A+
1  John   16    Math      B
2   Foo   16    Math      A
3   Bar   15    Math      F
4  Alex   15    Math      D
5   Tom   13    Math      C

I want the output to be like the above.

Problem 3:

If I was to group the melt and order the students by there scores, how would I be able to do that, to get the desired output like the below:

  value             Name                Subjects
0     A         Foo, Tom           Math, English
1    A+         Bob, Bar           Math, English
2     B  John, John, Foo  Math, English, English
3     C         Tom, Bob           Math, English
4     D             Alex                    Math
5     F        Bar, Alex           Math, English

I need it to be ordered and the names separated by comma and also the Subjects separated by comma in the same order respectively

Problem 4:

How would I unmelt a melted dataframe? Let's say I already melted this dataframe:

print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))

To become:

    Name  Age  Subject Grades
0    Bob   13     Math     A+
1   John   16     Math      B
2    Foo   16     Math      A
3    Bar   15     Math      F
4   Alex   15     Math      D
5    Tom   13     Math      C
6    Bob   13  English      C
7   John   16  English      B
8    Foo   16  English      B
9    Bar   15  English     A+
10  Alex   15  English      F
11   Tom   13  English      A

Then how would I translate this back to the original dataframe, the below:

   Name Math English  Age
0   Bob   A+       C   13
1  John    B       B   16
2   Foo    A       B   16
3   Bar    F      A+   15
4  Alex    D       F   15
5   Tom    C       A   13

How would I go about doing this?

Problem 5:

If I was to group by the names of the students and separate the subjects and grades by comma, how would I do it?

   Name        Subject Grades
0  Alex  Math, English   D, F
1   Bar  Math, English  F, A+
2   Bob  Math, English  A+, C
3   Foo  Math, English   A, B
4  John  Math, English   B, B
5   Tom  Math, English   C, A

I want to have a dataframe like above.

Problem 6:

If I was gonna completely melt my dataframe, all columns as values, how would I do it?

     Column Value
0      Name   Bob
1      Name  John
2      Name   Foo
3      Name   Bar
4      Name  Alex
5      Name   Tom
6      Math    A+
7      Math     B
8      Math     A
9      Math     F
10     Math     D
11     Math     C
12  English     C
13  English     B
14  English     B
15  English    A+
16  English     F
17  English     A
18      Age    13
19      Age    16
20      Age    16
21      Age    15
22      Age    15
23      Age    13

I want to have a dataframe like above. All columns as values.

Please check my self-answer below :)

解决方案

Note for users with pandas version under < 0.20.0, I will be using df.melt(...) for my examples, but your version would be too low for df.melt, you would need to use pd.melt(df, ...) instead.

Documentation references:

Most of the solutions here would be used with melt, so to know the method melt, see the documentaion explanation

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

And the parameters are:

Parameters

  • id_vars : tuple, list, or ndarray, optional

    Column(s) to use as identifier variables.

  • value_vars : tuple, list, or ndarray, optional

    Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name : scalar

    Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name : scalar, default ‘value’

    Name to use for the ‘value’ column.

  • col_level : int or str, optional

    If columns are a MultiIndex then use this level to melt.

  • ignore_index : bool, default True

    If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

    New in version 1.1.0.

Logic to melting:

Melting merges multiple columns and converts the dataframe from wide to long, for the solution to Problem 1 (see below), the steps are:

  1. First we got the original dataframe.

  2. Then the melt firstly merges the Math and English columns and makes the dataframe replicated (longer).

  3. Then finally adds the column Subject which is the subject of the Grades columns value respectively.

This is the simple logic to what the melt function does.

Solutions:

I will solve my own questions.

Problem 1:

Problem 1 could be solve using pd.DataFrame.melt with the following code:

print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))

This code passes the id_vars argument to ['Name', 'Age'], then automatically the value_vars would be set to the other columns (['Math', 'English']), which is transposed into that format.

You could also solve Problem 1 using stack like the below:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index(name="Grade")
    .rename(columns={"level_2": "Subject"})
    .sort_values("Subject")
    .reset_index(drop=True)
)

This code sets the Name and Age columns as the index and stacks the rest of the columns Math and English, and resets the index and assigns Grade as the column name, then renames the other column level_2 to Subject and then sorts by the Subject column, then finally resets the index again.

Both of these solutions output:

    Name  Age  Subject Grade
0    Bob   13  English     C
1   John   16  English     B
2    Foo   14  English     B
3    Bar   15  English    A+
4   Alex   17  English     F
5    Tom   12  English     A
6    Bob   13     Math    A+
7   John   16     Math     B
8    Foo   14     Math     A
9    Bar   15     Math     F
10  Alex   17     Math     D
11   Tom   12     Math     C

Problem 2:

This is similar to my first question, but this one I only one to filter in the Math columns, this time the value_vars argument can come into use, like the below:

print(
    df.melt(
        id_vars=["Name", "Age"],
        value_vars="Math",
        var_name="Subject",
        value_name="Grades",
    )
)

Or we can also use stack with column specification:

print(
    df.set_index(["Name", "Age"])[["Math"]]
    .stack()
    .reset_index(name="Grade")
    .rename(columns={"level_2": "Subject"})
    .sort_values("Subject")
    .reset_index(drop=True)
)

Both of these solutions give:

   Name  Age Subject Grade
0   Bob   13    Math    A+
1  John   16    Math     B
2   Foo   16    Math     A
3   Bar   15    Math     F
4  Alex   15    Math     D
5   Tom   13    Math     C

Problem 3:

Problem 3 could be solved with melt and groupby, using the agg function with ', '.join, like the below:

print(
    df.melt(id_vars=["Name", "Age"])
    .groupby("value", as_index=False)
    .agg(", ".join)
)

It melts the dataframe then groups by the grades and aggregates them and joins them by a comma.

stack could be also used to solve this problem, with stack and groupby like the below:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index()
    .rename(columns={"level_2": "Subjects", 0: "Grade"})
    .groupby("Grade", as_index=False)
    .agg(", ".join)
)

This stack function just transposes the dataframe in a way that is equivalent to melt, then resets the index, renames the columns and groups and aggregates.

Both solutions output:

  Grade             Name                Subjects
0     A         Foo, Tom           Math, English
1    A+         Bob, Bar           Math, English
2     B  John, John, Foo  Math, English, English
3     C         Bob, Tom           English, Math
4     D             Alex                    Math
5     F        Bar, Alex           Math, English

Problem 4:

We first melt the dataframe for the input data:

df = df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades')



Then now we can start solving this Problem 4.

Problem 4 could be solved with pivot_table, we would have to specify to the pivot_table arguments, values, index, columns and also aggfunc.

We could solve it with the below code:

print(
    df.pivot_table("Grades", ["Name", "Age"], "Subject", aggfunc="first")
    .reset_index()
    .rename_axis(columns=None)
)

Output:

   Name  Age English Math
0  Alex   15       F    D
1   Bar   15      A+    F
2   Bob   13       C   A+
3   Foo   16       B    A
4  John   16       B    B
5   Tom   13       A    C

The melted dataframe is converted back to the exact same format as the original dataframe.

We first pivot the melted dataframe and then reset the index and remove the column axis name.

Problem 5:

Problem 5 could be solved with melt and groupby like the following:

print(
    df.melt(id_vars=["Name", "Age"], var_name="Subject", value_name="Grades")
    .groupby("Name", as_index=False)
    .agg(", ".join)
)

That melts and groups by Name.

Or you could stack:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index()
    .groupby("Name", as_index=False)
    .agg(", ".join)
    .rename({"level_2": "Subjects", 0: "Grades"}, axis=1)
)

Both codes output:

   Name       Subjects Grades
0  Alex  Math, English   D, F
1   Bar  Math, English  F, A+
2   Bob  Math, English  A+, C
3   Foo  Math, English   A, B
4  John  Math, English   B, B
5   Tom  Math, English   C, A

Problem 6:

Problem 6 could be solved with melt and no column needed to be specified, just specify the expected column names:

print(df.melt(var_name='Column', value_name='Value'))

That melts the whole dataframe

Or you could stack:

print(
    df.stack()
    .reset_index(level=1)
    .sort_values("level_1")
    .reset_index(drop=True)
    .set_axis(["Column", "Value"], axis=1)
)

Both codes output:

     Column Value
0       Age    16
1       Age    15
2       Age    15
3       Age    16
4       Age    13
5       Age    13
6   English    A+
7   English     B
8   English     B
9   English     A
10  English     F
11  English     C
12     Math     C
13     Math    A+
14     Math     D
15     Math     B
16     Math     F
17     Math     A
18     Name  Alex
19     Name   Bar
20     Name   Tom
21     Name   Foo
22     Name  John
23     Name   Bob

Conclusion:

melt is a really handy function, often it's required, once you meet these types of problems, don't forget to try melt, it may well solve your problem.

Remember for users with pandas versions under < 0.20.0, you would have to use pd.melt(df, ...) instead of df.melt(...).


这篇关于如何融化 pandas 数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆