如何规范 pandas 数据框中的多列字典 [英] How to normalize multiple columns of dicts in a pandas dataframe

查看:70
本文介绍了如何规范 pandas 数据框中的多列字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是编码的新手,我可以理解,这是一个非常基本的问题

I am new to coding and I can understand that this is a very basic question

我有一个数据框为:

df

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}
 5             5  2021-01-14 20:00:00  Arsenal        Crystal Palace  {'1': 1500, 'X': 4000, '2': 6500}   {'yes': 1950, 'no': 1800}  {'1X': 1110, '12': 1220, '2X': 2500}
 6             6  2021-01-15 20:00:00  Fulham         Chelsea         {'1': 5750, 'X': 4330, '2': 1530}   {'yes': 1800, 'no': 1950}  {'1X': 2370, '12': 1200, '2X': 1140}
 7             7  2021-01-16 12:30:00  Wolverhampton  West Brom       {'1': 1440, 'X': 4200, '2': 7500}   {'yes': 2250, 'no': 1570}  {'1X': 1100, '12': 1220, '2X': 2620}
 8             8  2021-01-16 15:00:00  Leeds          Brighton        {'1': 2000, 'X': 3600, '2': 3600}   {'yes': 1530, 'no': 2370}  {'1X': 1280, '12': 1280, '2X': 1720}

我希望很好地格式化字典列表,并获取数据框,例如将full_time_result列拆分为full_time_result_1,full_time_result_X,full_time_result_2,并且Both_teams_to_score和double_chance的内容相同,如下所示:

I am looking to format the dictionary list nicely and get the dataframe as e.g. the full_time_result column would be split into full_time_result_1, full_time_result_X, full_time_result_2 and the same for both_teams_to_score and double_chance as below:

      Unnamed: 0  time                 home_team      away_team       full_time_result_1                    full_time_result_x                    full_time_result_2                    both_teams_to_score_yes        both_teams_to_score_no        double_chance_1X
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------

我正在按照此示例在此处给出,但我无法使其正常工作.这是我的代码:

I am following this example given here but I am unable to get it to work. Here is my code:

import pandas as pd
from tabulate import tabulate
df = pd.read_csv(r'C:\Users\Harshad\Desktop\re.csv')
df['full_time_result'] = df['full_time_result'].apply(pd.Series)
print(tabulate(df, headers='keys'))

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}

我们将不胜感激.

推荐答案

  • 验证列是 dict 类型,而不是 str 类型.
    • 如果列为 str 类型,请使用 ast.literal_eval 对其进行转换.
      • Verify the columns are dict type, and not str type.
        • If the columns are str type, convert them with ast.literal_eval.
        • import pandas as pd
          from ast import literal_eval
          
          # test dataframe
          data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
          df = pd.DataFrame(data)
          
          # display(df.head(2))
                            time  home_team  away_team                   full_time_result        both_teams_to_score                         double_chance
          0  2021-01-12 18:00:00  Sheff Utd  Newcastle  {'1': 2400, 'X': 3200, '2': 3100}  {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
          1  2021-01-12 20:15:00    Burnley    Man Utd  {'1': 7000, 'X': 4500, '2': 1440}  {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
          
          # convert time to datetime
          df.time = pd.to_datetime(df.time)
          
          # determine if columns are str or dict type
          print(type(df.iloc[0, 3]))
          [out]:
          str
          
          # convert columns from str to dict only if the columns are str type
          df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
          
          # normalize columns and rename headers
          ftr = pd.json_normalize(df.full_time_result)
          ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]
          
          btts = pd.json_normalize(df.both_teams_to_score)
          btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]
          
          dc = pd.json_normalize(df.double_chance)
          dc.columns = [f'double_chance_{col}' for col in dc.columns]
          
          # concat the dataframes
          df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)
          

          display(df_normalized)

                           time      home_team       away_team  full_time_result_1  full_time_result_X  full_time_result_2  both_teams_to_score_yes  both_teams_to_score_no  double_chance_1X  double_chance_12  double_chance_2X
          0 2021-01-12 18:00:00      Sheff Utd       Newcastle                2400                3200                3100                     2000                    1750              1360              1360              1530
          1 2021-01-12 20:15:00        Burnley         Man Utd                7000                4500                1440                     1900                    1900              2620              1180              1100
          2 2021-01-12 20:15:00  Wolverhampton         Everton                2450                3200                3000                     1950                    1800              1360              1360              1530
          3 2021-01-13 18:00:00       Man City        Brighton                1180                6500               14000                     2040                    1700              1040              1110              4500
          4 2021-01-13 20:15:00    Aston Villa       Tottenham                2620                3500                2500                     1570                    2250              1500              1280              1440
          5 2021-01-14 20:00:00        Arsenal  Crystal Palace                1500                4000                6500                     1950                    1800              1110              1220              2500
          6 2021-01-15 20:00:00         Fulham         Chelsea                5750                4330                1530                     1800                    1950              2370              1200              1140
          7 2021-01-16 12:30:00  Wolverhampton       West Brom                1440                4200                7500                     2250                    1570              1100              1220              2620
          8 2021-01-16 15:00:00          Leeds        Brighton                2000                3600                3600                     1530                    2370              1280              1280              1720
          

          合并代码

          # convert the columns to dict type if they are str type
          df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
          
          # normalize all columns
          df_list = list()
          
          for col in df.columns[3:]:
              v = pd.json_normalize(df[col])
              v.columns = [f'{col}_{c}' for c in v.columns]
              df_list.append(v)
          
          # combine into one dataframe
          df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)
          

          这篇关于如何规范 pandas 数据框中的多列字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆