如何使用NaNs json_normalize列? [英] How to json_normalize a column with NaNs?

查看:77
本文介绍了如何使用NaNs json_normalize列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  • 此问题特定于 pandas.DataFrame

  • 该问题取决于列中的值是 str dict 还是 list 类型。

  • 此问题解决了在 df.dropna()时处理 NaN 值的问题。 .reset_index(drop = True)无效。

  • This question is specific to columns of data in a pandas.DataFrame
  • This question depends on if the values in the columns are str, dict, or list type.
  • This question addresses dealing with the NaN values, when df.dropna().reset_index(drop=True) isn't a valid option.

  • 使用 str 类型的列,该列中的值必须转换为 dict 类型,使用 ast.literal_eval ,然后使用 .json_normalize

  • With a column of str type, the values in the column must be converted to dict type, with ast.literal_eval, before using .json_normalize.
import numpy as np
import pandas as pd
from ast import literal_eval

df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})

                            col_str
0  {"a": "46", "b": "3", "c": "12"}
1              {"b": "2", "c": "7"}
2                       {"c": "11"}
3                               NaN

type(df.iloc[0, 0])
[out]: str

df.col_str.apply(literal_eval)

错误:

df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan


案例2



  • 使用 dict 类型的列,使用 pandas.json_normalize 将键转换为列标题,将值转换为行

  • Case 2

    • With a column of dict type, use pandas.json_normalize to convert keys to column headers and values to rows
    • df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})
      
                                 col_dict
      0  {'a': '46', 'b': '3', 'c': '12'}
      1              {'b': '2', 'c': '7'}
      2                       {'c': '11'}
      3                               NaN
      
      type(df.iloc[0, 0])
      [out]: dict
      
      pd.json_normalize(df.col_dict)
      

      错误:

      pd.json_normalize(df.col_dict) results in AttributeError: 'float' object has no attribute 'items'
      


      案例3



      • 列中str 类型,在列表内包含 dict

      • 要标准化列

        • 应用 literal_eval ,因为在 str 类型

        • 展开列以分隔 dict 分隔行

        • 标准化列

        • Case 3

          • In a column of str type, with the dict inside a list.
          • To normalize the column
            • apply literal_eval, because explode doesn't work on str type
            • explode the column to separate the dicts to separate rows
            • normalize the column
            • df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})
              
                                                                  col_str
              0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
              1                       [{"b": "2", "c": "7"}, {"c": "11"}]
              2                                                       NaN
              
              type(df.iloc[0, 0])
              [out]: str
                  
              df.col_str.apply(literal_eval)
              

              错误:

              df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan
              


              推荐答案


              • 正如评论中指出的那样,始终可以选择:

                • df = df.dropna()。reset_index(drop = True)

                • 这里的虚拟数据很好,或者处理与其他列无关紧要的数据框时。

                • 对于需要附加列的数据框来说,不是一个很好的选择。


                  • 由于该列包含 str 类型,带有'{}'(a str )的fillna

                  • Since the column contains str types, fillna with '{}' (a str)
                  import numpy as np
                  import pandas as pd
                  from ast import literal_eval
                  
                  df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})
                  
                                              col_str
                  0  {"a": "46", "b": "3", "c": "12"}
                  1              {"b": "2", "c": "7"}
                  2                       {"c": "11"}
                  3                               NaN
                  
                  type(df.iloc[0, 0])
                  [out]: str
                  
                  # fillna
                  df.col_str = df.col_str.fillna('{}')
                  
                  # convert the column to dicts
                  df.col_str = df.col_str.apply(literal_eval)
                  
                  # use json_normalize
                  df = df.join(pd.json_normalize(df.col_str)).drop(columns=['col_str'])
                  
                  # display(df)
                       a    b    c
                  0   46    3   12
                  1  NaN    2    7
                  2  NaN  NaN   11
                  3  NaN  NaN  NaN
                  


                  <案例2


                  • 由于该列包含 dict 类型,所以fillna具有 {} (不是 str

                  • 由于<$ c,这需要使用字典理解来填充$ c> fillna({})不起作用

                  • Case 2

                    • Since the column contains dict types, fillna with {} (not a str)
                    • This needs to be filled using a dict-comprehension, since fillna({}) does not work
                    • df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})
                      
                                                 col_dict
                      0  {'a': '46', 'b': '3', 'c': '12'}
                      1              {'b': '2', 'c': '7'}
                      2                       {'c': '11'}
                      3                               NaN
                      
                      type(df.iloc[0, 0])
                      [out]: dict
                          
                      # fillna
                      df.col_dict = df.col_dict.fillna({i: {} for i in df.index})
                      
                      # use json_normalize
                      df = df.join(pd.json_normalize(df.col_dict)).drop(columns=['col_dict'])
                      
                      # display(df)
                           a    b    c
                      0   46    3   12
                      1  NaN    2    7
                      2  NaN  NaN   11
                      3  NaN  NaN  NaN
                      


                      <案例2


                      1. '[]'(a str

                      2. 现在 literal_eval 将起作用

                      3. .explode 可以在列上使用,将 dict 的值分隔为行

                      4. 现在, NaNs 需要用 {} 填充(而不是 str

                      5. 然后可以对列进行规范化

                      1. Fill the NaNs with '[]' (a str)
                      2. Now literal_eval will work
                      3. .explode can be used on the column to separate the dict values to rows
                      4. Now the NaNs need to be filled with {} (not a str)
                      5. Then the column can be normalized



                      • 列是列表中的个列表,而不是 str 个类型,请跳到 .explode

                        • For the case when the column is lists of dicts, that aren't str type, skip to .explode.
                        • df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})
                          
                                                                              col_str
                          0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
                          1                       [{"b": "2", "c": "7"}, {"c": "11"}]
                          2                                                       NaN
                          
                          type(df.iloc[0, 0])
                          [out]: str
                              
                          # fillna
                          df.col_str = df.col_str.fillna('[]')
                          
                          # literal_eval
                          df.col_str = df.col_str.apply(literal_eval)
                          
                          # explode
                          df = df.explode('col_str').reset_index(drop=True)
                          
                          # fillna again
                          df.col_str = df.col_str.fillna({i: {} for i in df.index})
                          
                          # use json_normalize
                          df = df.join(pd.json_normalize(df.col_str)).drop(columns=['col_str'])
                          
                          # display(df)
                               a    b    c
                          0   46    3   12
                          1  NaN    2    7
                          2  NaN    2    7
                          3  NaN  NaN   11
                          4  NaN  NaN  NaN
                          

                          这篇关于如何使用NaNs json_normalize列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆