将Pandas列中的字典/列表拆分成单独的列 [英] Splitting dictionary/list inside a Pandas Column into Separate Columns

查看:1942
本文介绍了将Pandas列中的字典/列表拆分成单独的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据保存在postgreSQL数据库中。我使用Python2.7查询这些数据,并将其转换成Pandas DataFrame。但是,这个数据框的最后一列有一个字典(或列表?)的值。 DataFrame如下所示:

  [1] df 
站号污染物
8809 {a :46,b:3,c:12}
8810 {a:36,b:5,c }
8811 {b:2,c:7}
8812 {c:11}
8813 {a:82 ,c:15}

我需要将此列分割成单独的列, DataFrame如下所示:

  [2] df2 
站号abc
8809 46 3 12
8810 36 5 8
8811 NaN 2 7
8812 NaN NaN 11
8813 82 NaN 15

我遇到的主要问题是列表长度不一样。但是所有的列表只包含最多相同的3个值:a,b和c。他们总是以相同的顺序出现(第一,第二,第三,第三)。



以下代码用于正常工作并返回我想要的(df2)。

  [3] df 
[4] objs = [df,pandas.DataFrame(df ['Pollutant Levels' ] .tolist())。iloc [:, 3]]
[5] df2 = pandas.concat(objs,axis = 1).drop('污染物等级',轴= 1)
[6] print(df2)

我上周正在运行这段代码,工作正常。但是现在我的代码是坏的,我从第[4]行得到这个错误:

  IndexError:切片上的超出范围(结束)

我没有更改代码,但我现在得到错误。我觉得这是因为我的方法不健壮或适当。



任何关于如何将这列列分成单独列的建议或指导,将超级赞赏!



编辑:我认为.tolist()和.apply方法不能在我的代码上工作,因为它是一个unicode字符串,即:

 #我的数据格式
u {'a':'1','b':'2','c':'3'}

#and not
{u'a':'1',u'b':'2',u'c':'3'}

数据是以这种格式从postgreSQL数据库导入的。对这个问题的任何帮助或想法?有没有办法转换unicode?

解决方案

要将字符串转换为实际的dict,可以执行 df ['污染物等级'] .map(eval)。之后,下面的解决方案可以用于将dict转换成不同的列。






使用一个小例子,您可以使用 .apply(pd.Series)

 在[2]中: df = pd.DataFrame({'a':[1,2,3],'b':[{'c':1},{'d':3},{'c':5,'d' :6}]})

在[3]中:df
输出[3]:
ab
0 1 {u'c':1}
1 2 {u'd':3}
2 3 {u'c':5,u'd':6}

在[4]中:df ['b' ] .apply(pd.Series)
Out [4]:
cd
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0

要将其与数据框的其余部分组合,您可以 concat 其他列以上结果:

 在[7]中:pd.concat([df.drop(['b'],axis = 1),df ['b']。apply(pd.Series)],axis = 1)
Out [7]:
acd
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0






使用你的代码,如果我省略了 iloc 部分: b
$ b

 在[15]中:pd.concat([df.drop('b',axis = 1)),pd.DataFrame(df ['b '] .tolist())],轴= 1)
输出[15]:
acd
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0


I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary (or list?) of values within it. The DataFrame looks like this:

[1] df
Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns so that the DataFrame looks like this:

[2] df2
Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: a, b, and c. And they always appear in the same order (a first, b second, c third).

The following code USED to work and return exactly what I wanted (df2).

[3] df 
[4] objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
[5] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
[6] print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end) 

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is importing from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the unicode?

解决方案

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.


Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0


Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

这篇关于将Pandas列中的字典/列表拆分成单独的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆