如何使用NaNs json_normalize列 [英] How to json_normalize a column with NaNs

查看:68
本文介绍了如何使用NaNs json_normalize列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 此问题特定于pandas.DataFrame
  • 中的数据列
  • 这个问题取决于列中的值是strdict还是list类型.
  • df.dropna().reset_index(drop=True)不是有效选项时,此问题解决了处理NaN值的问题.
  • This question is specific to columns of data in a pandas.DataFrame
  • This question depends on if the values in the columns are str, dict, or list type.
  • This question addresses dealing with the NaN values, when df.dropna().reset_index(drop=True) isn't a valid option.
  • 对于具有str类型的列,必须在使用.json_normalize之前将该列中的值转换为具有ast.literal_evaldict类型.
  • With a column of str type, the values in the column must be converted to dict type, with ast.literal_eval, before using .json_normalize.
import numpy as np
import pandas as pd
from ast import literal_eval

df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})

                            col_str
0  {"a": "46", "b": "3", "c": "12"}
1              {"b": "2", "c": "7"}
2                       {"c": "11"}
3                               NaN

type(df.iloc[0, 0])
[out]: str

df.col_str.apply(literal_eval)

错误:

df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan

案例2

  • 对于dict类型的列,请使用pandas.json_normalize将键转换为列标题,将值转换为行
  • Case 2

    • With a column of dict type, use pandas.json_normalize to convert keys to column headers and values to rows
    • df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})
      
                                 col_dict
      0  {'a': '46', 'b': '3', 'c': '12'}
      1              {'b': '2', 'c': '7'}
      2                       {'c': '11'}
      3                               NaN
      
      type(df.iloc[0, 0])
      [out]: dict
      
      pd.json_normalize(df.col_dict)
      

      错误:

      pd.json_normalize(df.col_dict) results in AttributeError: 'float' object has no attribute 'items'
      

      案例3

      • str类型的列中,dict放在list内.
      • 标准化列
        • 应用literal_eval,因为爆炸不适用于str类型
        • 展开列以将dicts分隔为单独的行
        • 标准化列
        • Case 3

          • In a column of str type, with the dict inside a list.
          • To normalize the column
            • apply literal_eval, because explode doesn't work on str type
            • explode the column to separate the dicts to separate rows
            • normalize the column
            • df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})
              
                                                                  col_str
              0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
              1                       [{"b": "2", "c": "7"}, {"c": "11"}]
              2                                                       NaN
              
              type(df.iloc[0, 0])
              [out]: str
                  
              df.col_str.apply(literal_eval)
              

              错误:

              df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan
              

              推荐答案

              • 正如评论中指出的那样,始终可以选择执行以下操作:
                • df = df.dropna().reset_index(drop=True)
                • 对于这里的虚拟数据,或者在处理其他列无关紧要的数据帧时,都很好.
                • 对于带有附加列的数据框来说,不是一个很好的选择.
                  • As pointed out in a comment, there is always the option to:
                    • df = df.dropna().reset_index(drop=True)
                    • That's fine for the dummy data here, or when dealing with a dataframe where the other columns don't matter.
                    • Not a great option for dataframes with additional columns that are required.
                      • 由于该列包含str类型,所以fillna用'{}'(a str)
                      • Since the column contains str types, fillna with '{}' (a str)
                      import numpy as np
                      import pandas as pd
                      from ast import literal_eval
                      
                      df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})
                      
                                                  col_str
                      0  {"a": "46", "b": "3", "c": "12"}
                      1              {"b": "2", "c": "7"}
                      2                       {"c": "11"}
                      3                               NaN
                      
                      type(df.iloc[0, 0])
                      [out]: str
                      
                      # fillna
                      df.col_str = df.col_str.fillna('{}')
                      
                      # convert the column to dicts
                      df.col_str = df.col_str.apply(literal_eval)
                      
                      # use json_normalize
                      df = df.join(pd.json_normalize(df.col_str)).drop(columns=['col_str'])
                      
                      # display(df)
                           a    b    c
                      0   46    3   12
                      1  NaN    2    7
                      2  NaN  NaN   11
                      3  NaN  NaN  NaN
                      

                      案例2

                      • 由于该列包含dict类型,所以fillna用{}(不是str)
                      • 由于fillna({})无法正常工作,因此需要使用dict-comprehension来填充
                      • Case 2

                        • Since the column contains dict types, fillna with {} (not a str)
                        • This needs to be filled using a dict-comprehension, since fillna({}) does not work
                        • df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})
                          
                                                     col_dict
                          0  {'a': '46', 'b': '3', 'c': '12'}
                          1              {'b': '2', 'c': '7'}
                          2                       {'c': '11'}
                          3                               NaN
                          
                          type(df.iloc[0, 0])
                          [out]: dict
                              
                          # fillna
                          df.col_dict = df.col_dict.fillna({i: {} for i in df.index})
                          
                          # use json_normalize
                          df = df.join(pd.json_normalize(df.col_dict)).drop(columns=['col_dict'])
                          
                          # display(df)
                               a    b    c
                          0   46    3   12
                          1  NaN    2    7
                          2  NaN  NaN   11
                          3  NaN  NaN  NaN
                          

                          案例3

                          1. '[]'(a str)填充NaNs
                          2. 现在literal_eval将起作用
                          3. 可以在列上使用
                          4. .explodedict值分隔为行
                          5. 现在NaNs需要用{}(不是str)填充
                          6. 然后可以对列进行规范化
                          1. Fill the NaNs with '[]' (a str)
                          2. Now literal_eval will work
                          3. .explode can be used on the column to separate the dict values to rows
                          4. Now the NaNs need to be filled with {} (not a str)
                          5. Then the column can be normalized

                          • 对于列不是dictslists的情况,请跳到.explode.
                            • For the case when the column is lists of dicts, that aren't str type, skip to .explode.
                            • df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})
                              
                                                                                  col_str
                              0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
                              1                       [{"b": "2", "c": "7"}, {"c": "11"}]
                              2                                                       NaN
                              
                              type(df.iloc[0, 0])
                              [out]: str
                                  
                              # fillna
                              df.col_str = df.col_str.fillna('[]')
                              
                              # literal_eval
                              df.col_str = df.col_str.apply(literal_eval)
                              
                              # explode
                              df = df.explode('col_str').reset_index(drop=True)
                              
                              # fillna again
                              df.col_str = df.col_str.fillna({i: {} for i in df.index})
                              
                              # use json_normalize
                              df = df.join(pd.json_normalize(df.col_str)).drop(columns=['col_str'])
                              
                              # display(df)
                                   a    b    c
                              0   46    3   12
                              1  NaN    2    7
                              2  NaN    2    7
                              3  NaN  NaN   11
                              4  NaN  NaN  NaN
                              

                              这篇关于如何使用NaNs json_normalize列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆