使用 pandas 将一列字典拆分/分解为单独的列 [英] Split / Explode a column of dictionaries into separate columns with pandas

查看:77
本文介绍了使用 pandas 将一列字典拆分/分解为单独的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将数据保存在 postgreSQL 数据库中.我正在使用 Python2.7 查询这些数据并将其转换为 Pandas DataFrame.但是,此数据框的最后一列在其中包含一个值字典.DataFrame df 看起来像这样:

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

我需要将此列拆分为单独的列,以便 DataFrame `df2 看起来像这样:

I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

我遇到的主要问题是列表的长度不同.但是所有列表最多只包含 3 个相同的值:'a'、'b' 和 'c'.而且它们总是以相同的顺序出现(首先是a",然后是b",然后是c").

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).

以下代码用于工作并准确返回我想要的(df2).

The following code USED to work and return exactly what I wanted (df2).

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)

我上周刚刚运行了这段代码,它运行良好.但是现在我的代码坏了,我从第 [4] 行收到这个错误:

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end) 

我没有对代码进行任何更改,但现在出现错误.我觉得这是因为我的方法不够健壮或不合适.

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

对于如何将此列列表拆分为单独的列的任何建议或指导将不胜感激!

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

我认为 .tolist() 和 .apply 方法不适用于我的代码,因为它是一个 Unicode 字符串,即:

I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

数据是以这种格式从postgreSQL数据库导入的.对这个问题有什么帮助或想法吗?有没有办法转换Unicode?

The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

推荐答案

要将字符串转换为实际的字典,您可以执行 df['Pollutant Levels'].map(eval).之后,可以使用下面的解决方案将dict转换为不同的列.

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.

举个小例子,你可以使用.apply(pd.Series):

Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

要将其与数据框的其余部分结合起来,您可以concat 其他具有上述结果的列:

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

<小时>

使用您的代码,如果我省略 iloc 部分,这也有效:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

这篇关于使用 pandas 将一列字典拆分/分解为单独的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆